Copyright 1996-2001 by Donald R. Tveter, commercial use is prohibited. Short quotations are permitted if proper attribution is given. This material CAN be posted elsewhere on the net if the posted files are not altered in any way but please let me know where it is posted. The main location is: http://dontveter.com/bpr/bpr.html |

Theory says that most normal functions can be approximated using a single hidden layer with a non-polynomial activation function, linear output units and N-1 hidden layer units where N is the number of patterns. (There is lots of fine print.) On the other hand it is desirable to use as few units as possible and then there is the issue of whether or not one or two hidden layers will give the best results. The available results are as follows:

There is an online article by Eduardo Sontag and another article by Daniel L. Chester that give reasons why two hidden layers may be necessary or worthwhile. There are also a few experimental results that show that two hidden layers can be better than one, see the article by Patrick M. Shea and Felix Liu and the article by Alvin J. Surkan and J. Clay Singleton Another experiment by Jacques de Villiers and Etienne Barnard showed that two hidden layer networks are only more prone to fall into bad local minima. A paper by Tamura and Tateishi proves that most normal functions can be approximated with two hidden layers using only N/2+3 hidden layer units implying that maybe two hidden layers are better.

My advice is to play around with a single hidden layer and then if necessary try an additional hidden layer.

From time to time some person recommends a formula for the best number of hidden layer units to use however my opinion and the opinion of the posters in comp.ai.neural-nets is that so far all such formulas are garbage. For problems where there are not many bends in the curves or surfaces you only need a small number of hidden layer units but for complex problems like the twin spirals problem you need quite a few hidden layer units. If you know ahead of time how complicated the geometry is for your problem (unlikely) then you can guess either "few" or "many" but that is it. At present the only sure way to know is trial and error. My advice is start small because small networks train faster.

Another option is to use an algorithm that adds hidden layer units one by one as necessary when the error fails to decrease by a certain amount. The original experiment was by Timur Ash and another well-known method is cascade correlation by Scott Fahlman but there are many others.