Backpropagator's Review

Copyright 1996-2001 by Donald R. Tveter, commercial use is prohibited. Short quotations are permitted if proper attribution is given. This material CAN be posted elsewhere on the net if the posted files are not altered in any way but please let me know where it is posted. The main location is: http://dontveter.com/bpr/bpr.html

Up to Backpropagator's Review

The Articles Part II

Last Change to This File: April 16, 1999

* "A Neural Network Model for the Gold Market" by Peter J. McCann and Barry L. Kalman by WWW from Washington University in St. Louis. This paper shows how a recurrent network learned to predict the tops and bottoms of the gold market.

* "PROBEN1 - A Set of Neural Network Benchmark Problems and Benchmarking Rules" by Lutz Prechelt from The University of Karlsruhe, Germany. Beginners will be surprised by how careful you have to be in assessing the performance of a backprop network. Of course, as the title says this paper also includes a number of real world data sets, available from Carnegie-Mellon and the University of Karlsruhe, Germany. Note that this is a 2MB file and when gunzipped comes to about 15MB.

* "Neural Networks and Statistical Models" by Warren S. Sarle from SAS Institute Inc., North Carolina . There is very little here that will improve your backprop "game" yet it is good for beginners to see how neural network models relate to statistical models.

* "Neural Network Implementations in SAS Software" by Warren S. Sarle from the SAS Institute Inc., North Carolina. This is another useful article for beginners and includes the "kangaroo" analogy for backprop training.

* "On the Analysis of Pyrolysis Mass Spectra Using Artificial Neural Networks: Individual Input Scaling Leads to Rapid Learning" by Mark J. Neal, Royston Goodacre and Douglas B. Kell, from the University of Wales in the United Kingdom. The application in this paper is hardly as glamorous as predicting the gold market however it includes an experiment on scaling the input data. In one case the data was scaled over the entire input set and in the other case it was scaled for each individual input. This later method was 100 times faster with plain backprop than the other method. I was a little confused by their method because it came up on the net in the context of a different scaling approach. Douglas Kell made this clear for me by writing the following:

WE just normalised each value by scaling to the range, i.e. if the smallest value in a column is xmin and the largest xmax then after scaling any value x goes to x' = (x-xmin)/(xmax-xmin). If one wants headroom (i.e. largest and smallest values not equal to exactly 1 and zero make xmax some percentage larger than it really is, xmin smaller than it is (but not less than 0).

(By the way the other method was to subtract the mean and divide by the standard deviation, a method that appears to me to be even better.)

* "How Biased is your Multi-Layer Perceptron?" by Martin Brown, P.C. An, C. J. Harris and H. Wang. An abstract is available from Southampton University, United Kingdom. To get a copy of the paper see Southampton University, United Kingdom. This was also published in volume 3 of the 1993 World Congress on Neural Networks, pp507-511. It was online until the lawyers at Southampton University found out about it and had it removed because having it publicly available was or could be a copyright violation. The authors here present arguments to show that learning is faster when the input values of a two-layer network with a linear output function are in the range -b to b rather than 0 to b. This is something people have noticed in 3-layer networks but in this article there is a theoretical explanation for it as well. Using the same reasoning they propose that using tanh for the activation function on the hidden layer of a three-layer network is also better. From time to time people have reported than tanh is better than the logistic but I had never found a case where this was so. But I went ahead and tried again with a problem I had not tried running this way before: sin(x) with the periodic update method and adjusting the etas in each layer and the gain in the hidden layer. I was surprised to find that networks converged twice as fast with tanh as with the logistic. Further testing showed tanh worked better with Rprop, SuperSAB and Quickprop with the inputs on the interval -pi to pi AND from 0 to 2 pi. Of course the fact than tanh was better here does not mean that tanh is ALWAYS better. If you try to follow the theory in this paper note that tau defined in equation 10 is used to fit the equation: exp(-t/tau) and not exp(-tau*t) as I had expected. (I asked Martin Brown about that.)

* "When Networks Disagree: Ensemble Methods for Hybrid Neural Networks" by Michael P. Perrone and Leon N. Cooper from the Neuroprose Archive at Ohio State. This paper proves that under certain conditions that averaging the output units of a number of networks will give you a better estimate than the best of these networks by itself. Unfortunately this only works under certain conditions and in my experiments with the sonar data the method fails but then maybe you'll have better luck with it on your problem. Now I do have a customer trying to predict the stock market who says that averaging a number of networks, including different size networks, ALWAYS gives a result better than the best network. The paper also includes a useful insight into why it works when it works. For a simple example where averaging outputs does work get the led problem (a C program to generate data) in the Machine Learning Database from the University of California at Irving.

* "Eigenvalues of Covariance Matrices: Application to Neural Network Learning" by Y. Le Cun, I Kanter and S. A. Solla in Physical Review Letters vol. 66, nr. 18 (1991) pp. 2396-2399. This recommendation came up in cainn from Morten With Pedersen, Message-ID: <487fda$fru@news.uni-c.dk>, however I have never taken the time to look it up. (If its really there would someone let me know?) The article explains why having inputs with zero means helps.

* "On the Recognition Capabilities of Feedforward Nets" by Eduardo D. Sontag available by ftp from the Neuroprose Archive at Ohio State. This is one very mathematical article and is definitely not recommended for those with Math Anxiety.

* "Feedback Stabilization Using Two-Hidden Layer Nets" by Eduardo D. Sontag available by ftp from the Neuroprose Archive at Ohio State. This is one very mathematical article and is definitely not recommended for those with Math Anxiety. But the bottom line, quoted from the abstract is: "A general result is given showing that nonlinear control systems can be stabilized using four layers, but not in general using three layers."

* Izui 1990, Yoshio Izui and Alex Pentland, "Speeding up Back-Propagation" by Yoshio Izue and Alex Pentland in IJCNN-90-WASH-DC volume 1, pp 639-642, Lawrence Erlbaum, 1990. The result here, that changing the sharpness/gain of activation function can help is important but there is no way to predict ahead of time what value will be best. Thus, there is no need to look up this article unless you want to see the mathematical proof of the time to convergence estimates.

* "Multilayer Feedforward Networks With a Nonpolynomial Activation Function Can Approximate Any Function" by Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus and Shimon Schocken Neural Networks volume 6, number 6, 1993, pp 861-867.

* "Improving Model Selection by Nonconvergent Methods" by William Finnoff, Ferdinand Hergert and Hans Georg Zimmermann in Neural Networks vol. 6, no. 6, pp771-783, 1993. This article gives experimental results for many algorithms that have been proposed to improve the results of training.

To Part 3 of the Articles

If you have any questions or comments, write me.

To Don's Home Page