|Copyright 1996-2001 by Donald R. Tveter, commercial use is prohibited. Short quotations are permitted if proper attribution is given. This material CAN be posted elsewhere on the net if the posted files are not altered in any way but please let me know where it is posted. The main location is: http://dontveter.com/bpr/bpr.html|
* "Efficient Backprop" by Yann LeCun, Leon Bottou, Genevieve B. Orr, Klaus-Robert Muller from AT&T Research. This is a fairly easy to read summary of the most useful methods to improve backprop with an explanation of why they work.
* "Bagging Predictors" by Leo Breiman from University of California at Berkeley. I'm including this paper because the technique is easy and I saw a neural network paper where with conjugate gradient it produced improved results. On the other hand every problem I've tried it on using non-conjugate gradient techniques (rprop, quickprop, SuperSAB, Delta-Bar-Delta and plain backprop) I have NOT gotten better results. It would be nice if someone would research it and find out if its only better with conjugate gradient.
* "Life in the Fast Lane: The Evolution of an Adaptive Vehicle Control System" by Todd Jochem and Dean Pomerleau in the Summer 1996 AI Magazine,pp 11-50.
* "Expert Prediction, Symbolic Learning and Neural Networks: An Experiment in Greyhound Racing" by Hsinchun Chen, Peter Buntin Rinde, Linlin She, Siunie Sutjahjo, Chris Sommer and Daryl Neely in the December 1994 IEEE Expert, pp21-27. In this test backprop beat ID3 and ID3 beat three human experts in terms of payoff. It is also available in an online version from The University of Arizona.
* "Speeding up Backpropagation Algorithms by using Cross-Entropy combined with Pattern Normalization" by Merten Joost and Wolfram Schiffmann", from University of Koblenz-Landau, Germany The authors show that by using the cross-entropy error measure on classification problems rather than the traditional sum squared error the net effect is to simply skip the derivative term in the traditional formulation. Apparently this skip the derivative term originated with the Differential Step Size method of Chen and Mars. I've been using it for years and worrying that I was biasing the training in a bad way but now thanks to this paper I know I don't have to worry. The other point in the paper is another obvious one: training goes faster when the inputs are translated and scaled by the mean and standard deviation. Finally the authors challenge everyone to produce better results on their test problem where the update algorithm is rprop.
* "Investigation of the CasCor Family of Learning Algorithms" by Lutz Prechelt in Neural Networks, volume 10, number 5, July 1997, pp 885-896. Cascade correlation and 5 other variations including the unpublished Cascade 2 algorithm by Fahlman are described and compared on 42 different data sets. Some results: it is slightly better not to cascade the hidden layer units and error minimization candidate training (as in Cascade 2) is better for regression problems while it may be a little worse for classification problems. Available online from The University of Karlsruhe, Germany.
* "Noisy Time Series Prediction using Symbolic Representation and Recurrent Neural Network Grammatical Inference" by Steve Lawrence, Ah Chung Tsoi and C. Lee Giles is available from NEC in New Jersey. This is a fairly easy to follow paper that shows the transformations that have to be done for financial data (currency fluctuations) including the use of a self-organizing map. Note how after all the work they go to the networks are only giving results that are somewhat better than the flip of a coin. I'd guess that better predictions could be made with additional data like budget deficits, inflation rates, unemployment rates and other data. A short version, "Rule Inference for Financial Prediction using Recurrent Neural Networks," for Proceedings of the IEEE/IAFE Conf. on Computational Intelligence for Financial Engineering, p. 253, IEEE Press, 1997, can be found at: NEC in New Jersey.