Copyright 1996-2001 by Donald R. Tveter, commercial use is prohibited. Short quotations are permitted if proper attribution is given. This material CAN be posted elsewhere on the net if the posted files are not altered in any way but please let me know where it is posted. The main location is: http://dontveter.com/bpr/bpr.html |

* "Constructive Learning of Recurrent Neural Networks: Limitations of Recurrent Cascade Correlation and a Simple Solution" by C. Lee Giles, Dong Chen, Hsing-Hen Chen, Yee-Chung Lee and Mark W. Goudreau available from NEC Research Institute, New Jersey. The solution to the problems of conventional RCC is to feedback output(s) into all previously frozen hidden layer unit(s). This solution while fixing the problem with RCC can slow down it's convergence for large networks.

* "Learning Long-Term Dependencies is not as Difficult with NARX Recurrent Neural Networks" by Tsungnan Lin, Bill G. Horne, Peter Tino and C. Lee Giles available from NEC Research Institute, New Jersey. If you've already got a recurrent network program then making the changes to get a NARX network will probably be easy. I know of somone who tried this on a stock market prediction problem where it did not improve results but as with all backprop methods you never know when they do work. If try this and you can say anything good or bad please let me know.

*
"SuperSAB: Fast Adaptive Back Propagation with Good Scaling Properties"
by Tom Tollenaere in *Neural Networks*, volume 3, pages 561-573.
CAUTION: there were some obvious typographical errors in the original
algorithm.

*
"Speed Improvement of the Back-Propagation on Current Generation
Workstations" by D. Anguita, G. Parodi and R. Zunino in *Proceedings
of the World Congress on Neural Networking*, Portland, Oregon, 1993,
volume 1 pages 165-168, Lawrence Erlbaum/INNS Press, 1993.

*
"Why Two Hidden Layers are Better than One" by Daniel L. Chester in
*IJCNN-90-WASH-DC*, Lawrence Erlbaum, 1990, volume 1, pp265-268.
The bottom line here is:

The problem with a single hidden layer is that the neurons interact with each other globally, making it difficult to improve an approximation at one point without worsening it elsewhere....

(With 2 hidden layers) the effects of the neurons are isolated and the approximations in different regions can be adjusted independently of each other, much as is done in the Finite Element Method for solving partial differential equations or the spline technique for fitting curves.

*
"Operational Experience with a Neural Network in the Detection of
Explosives in Checked Airline Luggage" by Patrick M. Shea and Felix
Liu in *IJCNN San Diego*, June 17-21 1990, IEEE Press, volume 2,
Press, pp 175-178. The authors report slightly better results with two
hidden layers but it took much longer to train the network.

*
"Neural Networks for Bond Rating Improved by Multiple Hidden Layers"
by Alvin J. Surkan and J. Clay Singleton in *IJCNN San Diego* June
17-21, 1990, volume 2, IEEE Press, pp 157-162.

*
"Backpropagation Neural Networks with One and Two Hidden Layers" by
Jacques de Villiers and Etienne Barnard in *IEEE Transactions on
Neural Networks*, vol 4, no 1, January 1992, pp 136-141. The bottom
line here was:

The above points lead us to conclude that there seems to be no reason to use four layer networks in preference to three layer nets in all but the most esoteric applications.

* "A Better Activation Function for Artificial Neural Networks" by David Elliott, available by ftp from the University of Maryland.

* "A Tree-Structured Neural Network for Real-Time Adaptive Control" by Alois P. Heinz available by http from the University of Freiburg, Germany. Note that while the paper mentions the approximation to the sigmoid the paper has nothing to do with activation functions.

* "An Introduction to the Conjugate Gradient Method Without the Agonizing Pain" by Jonathan Richard Shewchuk. The entire uncompressed postscript file is about 1.7M and is available by http from Carnegie-Mellon. It is also available by ftp from Carnegie-Mellon. Its also available by ftp in four parts ( part 1, part 2, part 3, part 4 ) by ftp from Carnegie-Mellon. This account is fairly readable but you still need to know calculus and some elementary linear algebra.

*
Patrick van der Smagt has a couple of articles online that are much the
same as his article, "Minimisation Methods for Training Feedforward
Neural Networks" in *Neural Networks*, volume 7, number 1, pp 1-11,
1994. One version is available from the
German Aerospace Research Establishment
but contains less theory than the *Neural Networks* article. A
newer version is chapter 2 of his thesis from the
German Aerospace Research Establishment or from the
University of Amsterdam.
The references of the latter paper are not included, but are available
in another chapter of his thesis from the
German Aerospace Research Establishment
and his thesis from the
University of Amsterdam.
His C software is available from the
University of Amsterdam. I have
never tried it, if it is any good please let me know. As to the
*NN* article I noticed one test mentioned toward the end that
could be redone. Patrick used an adaptive learning rate algorithm by
Silva and Almeida on the sin(x) * cos(2*x) problem and reported that
it took on the order of 2 million function evaluations to meet the
tolerance. A problem with that result is that it used data that ran
from 0 to 2 * pi, a range that certainly does hurt convergence. When
I tried this problem with rprop and data symmetric around 0 it
converged fairly reliably with 180,000 function evaluations. This is
still MUCH WORSE that the result for the CG method but it does go to
show how careful you have to be with comparing methods.

*
"Dynamic Node Creation in Backpropagation Networks" by Timur Ash in *
Connection Science* volume 1, pages 365-375, 1989.
My experience with this is that it will get you out of a local minimum
in artificial problems like xor but it does not seem to be useful in
real world problems and in fact it may hurt. Moreover, it tends to
degenerate to just adding a hidden node at some regular interval.

*
"Efficiency of Modified Backpropagation and Optimization Methods on a
Real-world Medical Problem" by Dogan Alpsan, Michael Towsey, Ozcan
Ozdamar, Ah Chung Tsoi and Dhanjoo N. Ghista in *Neural Networks,*
volume 8, number 6, pp 945-962. The authors tried various methods to
speed up and improve generalization on their problem, one set of
experiments simply trained the networks to within a given tolerance
while in the second set they trained to a local minimum. In the first
set plain backprop did very well. In the second set the more
sophisticated algorithms found local minima that backprop could not yet
ultimately this did not improve the performance on the test set.

* "The Interchangeability of Learning Rate and Gain in Backpropagation Neural Networks" by G. Thimm, P. Moerland and E. Fiesler available from Dalle Molle Institute for Perceptive Artificial Intelligence, Switzerland. This little article shows how gain is related to the learning rate and size of the initial weights.

* "What Size Neural Network Gives Optimal Generalization? Convergence Properties of Backpropagation" by Steve Lawrence, C. Lee Giles and Ah Chung Tsoi available from NEC Research Institute, New Jersey This is a 19 page technical report containing experiments on networks that show that minimizing the number of hidden units does not always lead to the best generalization. It includes pointers to other such articles and explanations of why this is so.

* "An Alternative Choice of Output in Neural Network for the Generation of Trading Signals in a Financial Market" by Charles Lam and Lam Kin at The University of Hong Kong. This html paper with gifs explores another method for predicting the ups and downs of an individual stock. It includes references to many other papers on the subject and to Charles Lam's thesis. His online thesis is available in HTML and Microsoft Word format unfortunately it is not available in postscript.

* "Back Propagation Family Album" by Jondarr Gibb, from . Macquarie University, Australia is a 72 page postscript file describing variations on backprop. Ultimately there are only 48 pages of text, the rest consists of references, appendicies, title page, etc.. The text is fairly easy reading and the appendicies include pseudo code for quickprop and the scaled conjugate gradient algorithms.

*
"Capabilities of a Four-Layered Feedforward Neural Network:
Four Layers Versus Three" by Shin'ichi Tamura and Masahiko
Tateishi in *IEEE Transactions on Neural Networks*, Vol 8,
No 2, March 1997. This paper gives a proof that most normal
functions can be approximated as closely as you like with
two hidden layers using only N/2+3 hidden layer units where
N is the number of patterns. This is an improvement over the
result for a single hidden layer where N-1 units are needed.
The implication is that 4 layer networks MAY give better results
than 3 layer networks because fewer units and therefore fewer
connections are needed. This paper is only a proof and no
experimental evidence is provided. The paper is available
online for IEEE members.