Copyright 1996-2001 by Donald R. Tveter, commercial use is prohibited. Short quotations are permitted if proper attribution is given. This material CAN be posted elsewhere on the net if the posted files are not altered in any way but please let me know where it is posted. The main location is: http://dontveter.com/bpr/bpr.html |

* "Efficient Backprop" by Yann LeCun, Leon Bottou, Genevieve B. Orr, Klaus-Robert Muller from AT&T Research. This is a fairly easy to read summary of the most useful methods to improve backprop with an explanation of why they work.

* "Bagging Predictors" by Leo Breiman from University of California at Berkeley. I'm including this paper because the technique is easy and I saw a neural network paper where with conjugate gradient it produced improved results. On the other hand every problem I've tried it on using non-conjugate gradient techniques (rprop, quickprop, SuperSAB, Delta-Bar-Delta and plain backprop) I have NOT gotten better results. It would be nice if someone would research it and find out if its only better with conjugate gradient.

*
"Life in the Fast Lane: The Evolution of an Adaptive Vehicle
Control System" by Todd Jochem and Dean Pomerleau in the Summer
1996 *AI Magazine*,pp 11-50.

*
"Expert Prediction, Symbolic Learning and Neural Networks: An
Experiment in Greyhound Racing" by Hsinchun Chen, Peter Buntin
Rinde, Linlin She, Siunie Sutjahjo, Chris Sommer and Daryl
Neely in the December 1994 *IEEE Expert*, pp21-27. In this
test backprop beat ID3 and ID3 beat three human experts in
terms of payoff. It is also available in an online version from
The University of Arizona.

* "Speeding up Backpropagation Algorithms by using Cross-Entropy combined with Pattern Normalization" by Merten Joost and Wolfram Schiffmann", from University of Koblenz-Landau, Germany The authors show that by using the cross-entropy error measure on classification problems rather than the traditional sum squared error the net effect is to simply skip the derivative term in the traditional formulation. Apparently this skip the derivative term originated with the Differential Step Size method of Chen and Mars. I've been using it for years and worrying that I was biasing the training in a bad way but now thanks to this paper I know I don't have to worry. The other point in the paper is another obvious one: training goes faster when the inputs are translated and scaled by the mean and standard deviation. Finally the authors challenge everyone to produce better results on their test problem where the update algorithm is rprop.

*
"Investigation of the CasCor Family of Learning Algorithms"
by Lutz Prechelt in *Neural Networks*, volume 10, number
5, July 1997, pp 885-896. Cascade correlation and 5 other
variations including the unpublished Cascade 2 algorithm by
Fahlman are described and compared on 42 different data sets.
Some results: it is slightly better not to cascade the hidden
layer units and error minimization candidate training (as in
Cascade 2) is better for regression problems while it may be a
little worse for classification problems. Available online from
The University of Karlsruhe, Germany.

*
"Noisy Time Series Prediction using Symbolic Representation and
Recurrent Neural Network Grammatical Inference" by Steve
Lawrence, Ah Chung Tsoi and C. Lee Giles is available from
NEC in New Jersey.
This is a fairly easy to follow paper that shows the
transformations that have to be done for financial data
(currency fluctuations) including the use of a self-organizing
map. Note how after all the work they go to the networks are
only giving results that are somewhat better than the flip
of a coin. I'd guess that better predictions could be made
with additional data like budget deficits, inflation rates,
unemployment rates and other data. A short version, "Rule
Inference for Financial Prediction using Recurrent Neural
Networks," for *Proceedings of the IEEE/IAFE Conf. on
Computational Intelligence for Financial Engineering*,
p. 253, IEEE Press, 1997, can be found at:
NEC in New Jersey.