Professional Basis of AI Backprop Hypertext Documentation

Copyright (c) 1990-97 by Donald R. Tveter

Gradient Descent

Gradient descent is plain backpropagation without momentum or any other means of speeding up the training, the goal is to make the training move downhill along the slope or gradient. In practice gradient descent is so slow it is rarely used. The first basic improvement was to use momentum to speed up the training and while that is not gradient descent I've lumped it in under the title "Gradient Descent" anyway.

The Etas

In most derivations of backprop eta is the learning rate so that is the name used here. There is an entry box window for the upper level of a three or more layer network and an entry box for the lower level(s) of a three or more layer network. One typed command sets both learning rates, so if you want the upper level eta to be 0.5 and the lower level eta to be 0.05 the typed command is:

e 0.5 0.05
With the differential step size derivative the orginal recommendation was to make the lower level eta much smaller (1/10th) than the size of the upper level eta but you never can be sure what the best values are, you must experiment.

Alpha

Alpha is the name most often given to the momentum value. If the last weight change was olddeltaw and the current weight change will be deltaw (delta w = eta * slope) then alpha is used like so:

new weight change = deltaw + alpha * olddeltaw

Reasonable values for alpha are 0 to 0.999999, most often it is 0.9 but you never know what value is best except by trial and error. Alpha can be entered in the entry box or the typed command is:

a 0.9   * set the momentum to 0.9

The Update Methods

In the periodic update method (alias batch update method) all the weight changes are summed up and made once after all the patterns have been run through the network. In this case the more patterns you have the larger the weight changes will be but weight changes that are too large will ruin the training. To beat this a reasonable value to choose for eta is 1/n or 2/n where n is the number of training set patterns.

In the "wrong" continuous update method you move across the output units, at each unit you update the weights and then pass an error quota back to the hidden layer units. This is normally regarded as wrong, the right way to do it is to calculate all the weight changes but don't make them just yet, first pass back the error quotas and then make the changes. Even though one is "right" and the other is "wrong" (I programmed the wrong one first :)) they both work and sometimes the "wrong" one is a little faster. In the right and wrong continuous update methods the value for eta is up for grabs and it does not depend on how many patterns are in the training set.

You can make your choice of these methods by clicking the appropriate button or the typed commands are:

a up   * update is periodic
a uC   * update is the right continuous
a uc   * update is the wrong continuous