Plain backprop is terribly slow and everyone wants it to go faster. There are a series of things that can be done to speed up learning:
A summary of these and other methods can be found in the online article: Efficient Backprop.
The first major improvement to backprop is extremely simple: you can fudge the derivative term in the output layer. If you are using the usual backprop activation function:
the derivative is:
where s is the activation value of the output unit and most often D = 1. The derivative is largest at s = 1/2 and it is here that you will get the largest weight changes. Unfortunately as you near the values 0 or 1 the derivative term gets close to 0 and the weight changes become very small. In fact if the network's response is 1 and the target is 0, that is the network is off by quite a lot, you end up with very small weight changes. It can take a VERY long time for the training process to correct this. More than likely you will get tired of waiting. Fahlman's solution was to add 0.1 to the derivative term making it:
The solution of Chen and Mars was to drop the derivative term altogether, in effect the derivative was 1. This method passes back much larger error quotas to the lower layer, so large that a smaller eta must be used there. In their experiments on the 10-5-10 encoder problem they found the best results came when that eta was 0.1 times the upper level eta, hence they called their method the "differential step size" method. One tenth is not always the best value so you must experiment with both the upper and lower level etas to get the best results. Besides that, the eta you use for the upper layer must also be much smaller than the eta you use without this method. I almost always use this method of fudging the derivative term. I had been worrying that this biases results however in an online article by Joost and Schiffmann they show that with the cross entropy error measure (appropriate for classification problems) the net effect is that the output layer derivative is 1.
Adding direct connections from the input layer to the output layer can often speed up training. It is supposed to work best when the function to be approximated is almost linear and it only needs a small amount of adjustment from nonlinear hidden layer units. This method can also cut down on the number of hidden layer units you need. It is not recommended when there are a large number of output units because then you add more free parameters to the net and possibly hurt generalization. One very mathematical online article by Eduardo Sontag discusses the virtue of the direct connections.
In an article by Izui and Pentland they show that training time can be decreased by increasing the sharpness or gain (D) in the standard sigmoid:
In fact they show that the training time goes as 1 / D for training without momentum and 1 / sqrt(D) for networks with momentum. This is not a perfect speed-up scheme since when D is too large you run the risk of becoming trapped in a local minimum. Sometimes the best value for D is less than 1. This can be used in combination with all other training algorithms.
Everyone wants faster training and there are many variations on backprop that will speed up training times enormously, however at times I worry about these acceleration algorithms: do they give the best possible results on the test set? From what I can tell between my own experiments and postings in cainn it may be that the very slow online update method will sometimes give the best results so what you get from these acceleration algorithms may be slightly less than what you can achieve with this pitifully slow method. People have observed this with the sonar data: the best results come from one pattern at a time updates. Having said that, in most cases the acceleration algorithms work so much faster than either online or batch backprop you should try them first and then if you want slightly better results you can try the slower online update method. Given that online updating is so slow you may give up on it long before you get a network trained.
First there is a whole collection of algorithms where you assign a different learning rate (usually called eta) to each weight. As the training proceeds you increase eta in some way if you keep going downhill in terms of the error. When the weight change gets too large you end up on the other side of the valley and for this you must decrease the learning rate in some way. Algorithms vary on both the speeding up and slowing down details. Second there is a set of algorithms known as conjugate gradient methods. Third there are the methods that build up the network one hidden unit at a time. Here are some of common algorithms:
Rprop may be one of the best training algorithms and there are a number of online articles available on it: rprop, While rprop may be the best choice in more cases than any other it is not the best in every case, from time to time I run into a problem where rprop behaves poorly.
Quickprop is also one of the better training algorithms and is loosely based on Newton's method for finding the root of a quadratic function. The article on this by Scott Fahlman is available online.
The Super self-adjusting back-propagation algorithm (SuperSAB) was developed by Tom Tollenaere. My experience with it so far shows that it is not as fast as quickprop or rprop with standard problems however at this stage in neural networking research this is not the final verdict, sometimes it probably will be the best. I did see a posting on the net that SuperSAB was "among the best" algorithms in another research report. To get good results you may have to limit each eta to some relatively small value, say 1 to 5 or so.
Cascade correlation by Fahlman and Lebiere may be the fastest available training method and it constructs the network by adding hidden units one at a time. For a description see the online article. One of the reasons it is so fast is that only the new weights are trained, the rest of the weights stay fixed. A recent paper by Lutz Prechelt compares the original algorithm with 5 variations.
Recurrent Cascade correlation (RCC) adapts CC to recurrent networks and the RCC article is available online. However RCC as it was originally proposed is incapable of representing and thus learning certain types of temporal sequences (certain automata for example), for a proof, explanation and improvements see the online article by C.L. Giles and others.
There is a collection of methods called "conjugate gradient methods" for training networks that are reportedly very fast. A very detailed tutorial on the CG approach is a 50+ page online postscript article by Jonathan Shewchuk. Another online source is a section of the book, Numerical Recipes in C where one section covers Conjugate Gradient Methods in Multidimensions 420. A collection of online articles by Patrick van der Smagt gives some results on xor, fitting sin(x)*cos(2*x) and tan(x). CG performed especially well on the function fitting problems yet these were data sets without noise and normally people use real world and therefore noisy data. In a test by Alpsan and others with real world data CG did not perform any better than less sophisticated methods.
There are many other ways to speed up training, for an online article that lists quite a few of them see the article by Schiffmann and others. A recent offline article by Dogan Alpsan and others gives interesting results using various algorithms on a medical classification problem.