Backpropagator's Review

Copyright 1996-2001 by Donald R. Tveter, commercial use is prohibited. Short quotations are permitted if proper attribution is given. This material CAN be posted elsewhere on the net if the posted files are not altered in any way but please let me know where it is posted. The main location is: http://dontveter.com/bpr/bpr.html

Up to Backpropagator's Review

Activation Functions

Last Change to This File: November 18, 1998

Neural networking theory shows that backprop networks can represent most reasonable functions as close as you like with linear output units and a single layer of non-polynomial hidden layer units, for instance see the article by Leshno, Lin, Pinkus and Schocken. There are however many activation functions that you can choose from and each one has its own special virtues.

The original network activation function was the linear activation function:

y = D*x

where x is the input to the neuron, y is the final value of the neuron and usually D = 1. Linear functions are inadequate to approximate most functions, some non-linear functions are needed.

The standard sigmoid (or logistic) runs from 0 to 1 and it is:

y = 1 / (1 + exp (- D * x))

where the input to the neuron is x and most often D = 1 but see the gain entry. The derivative is: y * (1 - y). There is some theory and a few experiments that show that hidden layer unit activation values centered around 0 will speed up training so in some cases people subtract 0.5 from the above.

The standard sigmoid can be approximated using the function:

      x                | f(x)
      -------------------------------------
      x >= 1           | 1
      -1 < x < 1       | 0.5 + x * (1 - abs(x) / 2)
      x <= -1          | 0
if you use x = input / 4.1. The maximal absolute deviation between this function and the standard sigmoid is less than 0.02. This function is mentioned in an online article by Alois P. Heinz.

Another popular function is tanh, it has outputs in the range -1 to 1 it can be written as:

y = 2 / (1 + exp(-2 * x)) - 1

The derivative is: 1-y*y. Because its values are centered around 0 there is reason to believe that using tanh will result in faster training, for instance see the article by Brown et. al. and the online article by Kalman and Kwasny. My experiments show that sometimes tanh is better but sometimes it is not.

A numerically simpler approximation to the tanh function is described in an article by Anguita, Parodi and Zunino. The approximation they proposed is:

      x                     | f(x)
      -----------------------------------------------------------
      x > 1.92033           | 0.96016
      0 < x <= 1.92033      | 0.96016 - 0.26037 * (x - 1.92033)^2
      -1.92033 < x < 0      | 0.26037 * (x + 1.92033)^2 - 0.96016
      x <= -1.92033         | -0.96016

Besides being faster to compute the derivative term is never less than 0.0781 so you can avoid fudging the derivative term.

You can save on CPU time by approximating the sigmoid functions with a series of straight lines, a piecewise-linear function. Piece-wise linear functions may require more iterations to solve the problem but even so you will probably save a little on CPU time.

Sometimes the Gaussian function:

y = exp(- x * x)

is used and in some cases it can produce faster training, for instance in the 2-1-1 xor network with direct input to output connections you can get faster training. The Gaussian also improves the performance of the 10-5-10 encoder problem. The derivative is: -2*x*y where x is the input and y is the value of the neuron.

The following sigmoid is mentioned in a very short online article by David Elliott, runs from -1 to 1 and is also faster to compute

y = x / (1 + |x|)

The derivative is: 1/((1+|x|)*(1+|x|)).

This sigmoid runs from 0 to 1 and is also faster to compute:

y = (x / 2) / (1 + |x|) + 0.5

The derivative is: 1/(2*(1+|x|)*(1+|x|)).

Both these sigmoid approach their extremes more slowly. This means that if you are trying to output numerical values it will take more iterations to reach your target value. But if you're doing a classification problem you really only care to get the correct output value greater than the other outputs and here these functions will save on CPU time without influencing the number of iterations required by very much.

Theory says that backprop can approximate most normal functions if and only if the hidden layer unit is non-polynomial however the following function can also work:

y = sign(x) * x * x

and it has the virtues of running from minus infinity to plus infinity and being fast to compute. The derivative is sign(x)*2*x. Be careful, however, quite often the calculations will go wild unless you use very small learning rates, or better still use an acceleration algorithm that will automatically control the learning rates.

If you have any questions or comments, write me.

To Don's Home Page