The items in this menu window deal with parameters that affect the numerical operations of the training process. A number of different variations on the original back-propagation algorithm have been proposed in order to speed up convergence and some of these are included in this program.
There are a number of activation functions available and each one has its special virtues, the available functions are:
a for an approximate version of t g for the Gaussian: exp(-(D*x)**2) l for the linear activation function: D * x p for the piece-wise linear version of s s for the traditional function: 1.0 / (1.0 + exp(-D * x)) t for the function: tanh(D*x) x for the sigmoid: D * x / (1 + |D * x|) y for the sigmoid: (D * x / 2) / (1 + |D * x|) + 0.5 z gives (D*x)**2 for x >= 0 and -(D*x)**2 for x < 0
where x in the sum of the inputs and D is the sharpness or gain (see the section below for a description of this parameter).
The typed commands to set activation functions are:
a a <char> * to set the activation function for all layers to <char> a ah <char> * to set the hidden layer(s) function to <char> a ao <char> * to set the output layer function to <char>
The s function is the standard smooth activation function originally used by researchers and it is still the most commonly used one. It runs from 0 to 1 and this is usually inappropriate for the output layer units for numerical approximation problems.
The piece-wise activation function designated by p is a piece-wise linear approximation to the traditional function and using it will normally save some CPU time even though it may increase the number of iterations you need to solve the problem. For example on a 486/33 using p rather than s can take 20% less CPU time even while still requiring more iterations. When the input to a node is 7.625 or more or -7.625 or less with D = 1 the derivative term becomes 0 and this may prevent learning. To beat the possible problem of a zero derivative use one of the fudged derivative terms described in the derivatives section below.
The tanh function is popular with many people because it runs from -1 to 1. Sometimes tanh is better than the standard sigmoid. The a function is a numerical approximation to tanh that is faster to compute and its values run from -0.96016 to +0.96016.
The functions x and y are other functions that people have tried. The x function runs from -1 to 1 and y runs from 0 to 1. Computing the values of these functions seems to be as fast as the piece-wise linear function but they may sometimes give bad results in terms of the number of iterations required to solve a problem especially with recurrent networks.
The linear function is usually used in the output layer of a network when the problem is a function evaluation type problem and the range of the outputs goes beyond 0 to 1 or -1 to 1. There is no point in using it in the hidden layer. Sometimes a linear output function will ruin the calculations when it is used with plain backprop and output values of "NaN" (not a number), in these cases use one of the self-adjusting algorithms like delta-bar-delta, quickprop, rprop or SuperSAB.
The z function can also be used to produce outputs across a wide range as well. When using this function the calculations often overflow however and in this case a value of D < 1 can help. The z function works well with the algorithms that automatically adjust the learning rates however it is prone to fail with fixed size learning rates.
The Gaussian function, g, seems to work especially well with the xor problem and in fact with this function the problem can be learned using only a 2-1 network. It is the traditional bell shaped function that runs from 0 at an input of minus infinity to 1 when the input is 0 and then approaches 0 as x approaches plus infinity.
After you sum up the inputs to a unit you can multiply the sum by some factor called the sharpness or gain. Normally the value is 1 but by increasing or decreasing this value you can sometimes speed up convergence. Its also a cheap way of scaling input values down when the inputs are large (using the scale program is a better idea). Large input values can hurt training and making the sharpness/gain less than 1 can help. There are two entry boxes here, one for all the hidden layer units and one for the output layer. The typed commands look like:
a Dh 2 * sets hidden layer sharpness to 2 a Do 0.9 * sets the output layer sharpness to 0.9
Implementation note: strictly speaking the value of D should be used in the calculations to compute the derivatives for the weight changes however this implementation does not do this for any of the activation functions on the grounds that it requires an extra multiplication. Normally you'll normally want to use one of the algorithms that self-adjust the learning rate anyway and then multiplying by D is useless anyway. If you really want the extra D taken into account then you can set the flag, EXACTDERIV in the file real.cpp and recompile.
The correct derivative for the standard activation function is s(1-s) where s is the activation value of a unit however when s is near 0 or 1 this term will permit only very small weight changes during the learning process. To counter this problem Scott Fahlman proposed the following one for the output layer:
0.1 + s(1-s)
and it produces faster training almost all the time.
Besides Fahlman's derivative and the original one the differential step size method (a bad name) by Chen and Mars drops the derivative term in the formula for the output layer and uses the correct derivative term for all other layers. The name "differential step size" comes from the fact that the hidden layer learning rate should be much smaller than the output layer learning rate.
A paper by Schiffmann and Joost shows that simply dropping the derivative term in the output layer is appropriate for classification problems since that is what you get if you use the cross entropy error function rather than the squared difference error function.
The typed commands are:
a dc * use the correct derivative for whatever function a dd * use the differential step size derivative (default) a df * use Fahlman's derivative in only the output layer a do * use the original derivative (same as `c' above)
Hidden layer units can be added to the network as the training goes on. This has been shown to be an effective way of getting a network out of a local minimum. The possible choices here are no network building, using the Ash algorithm or using the pseudo-cascade correlation network structure when adding a hidden layer unit to a three layer network. The typed commands are:
a n - * no network building a n a * Ash algorithm a n p * pseudo cascade correlation
The parameters that affect network building can be found in the Network (N) menu window.
The choices are the periodic (batch) method, the "right" continuous (online) method, the "wrong" continuous (online) method, delta-bar-delta, quickprop, rprop and SuperSAB. There are details on this methods in the D, G, Q, R and S menu windows. In many cases Rprop will give the fastest training. The typed commands to set the update methods are:
a uC * for the "right" continuous update method a uc * for the "wrong" continuous update method a ud * for the delta-bar-delta method a up * for the original periodic update method (default) a uq * for the quickprop algorithm a ur * for the rprop algorithm a us * for SuperSAB
One way to improve generalization is to use weight decay. In this procedure each weight in the network is decreased by a small factor at the end of each training cycle. If the weight is w and the weight decay term is 0.0001 then the decrease is given by:
w = w - 0.0001 * wReasonable values to try for weight decay are 0.001 or less. There is one report that the best time to start weight decay is when the network reaches a minimum on the test set but it is difficult to decide when the network has reached a minimum. There is no automatic way of turning on weight decay at a certain point although its one of those things I should add. The typed command is: "a wd 0.0005" to set the weight decay factor to 0.0005.
The program will stop training when the output values are close enough to the target values. Close enough is defined by the 'tpu' command as in:
tpu 0.1 * tolerance per unit is 0.1where every output unit for every pattern must be within 0.1 of its target value. This can be a little undesirable in that to meet this standard the network may be overtrained. Another looser standard for converge is the tolerance overall described below. In the A menu window there is an entry box where you can type in a new tolerance value.
Older versions of the software used the 't' command which only allowed the tolerance to be set to values in the range [0..1) but 'tpu' allows any positive value.
A looser stopping criteria is when the average of all the output unit values for every pattern falls below a certain value. This means that many patterns will not be completely learned but in real world data it is not often that all the patterns can be learned especially without overfitting. The typed version of the command to set the tolerance overall to 0.2 is:
t 0.2In the A menu window there is an entry box where you can type in a new tolerance value.