Professional Basis of AI Backprop Hypertext Documentation

Copyright (c) 1990-97 by Donald R. Tveter

Delta-Bar-Delta

The delta-bar-delta method attempts to find a learning rate, eta, for each individual weight as the training proceeds. The parameters are the initial value for the etas, the amount by which to increase an eta that seems to be too small, the rate at which to decrease an eta that is too large, a maximum value for each eta and a parameter used in keeping a running average of the slopes. Here are examples of setting these parameters:

d d 0.5    * sets the decay rate to 0.5
d e 0.1    * sets the initial etas to 0.1
d k 0.25   * sets the amount to increase etas by (kappa) to 0.25
d m 10     * sets the maximum eta to 10
d t 0.7    * sets the history parameter, theta, to 0.7
These settings can all be placed on one line:

d d 0.5  e 0.1  k 0.25  m 10  t 0.7
The version implemented here does not use momentum.

The idea behind the delta-bar-delta method is to let the program find its own learning rate for each weight. The `e' sub-command sets the initial value for each of these learning rates. When the program sees that the slope of the error surface averages out to be in the same direction for several iterations for a particular weight the program increases the eta value by an amount, kappa, given by the `k' parameter. The network will then move down this slope faster. When the program finds the slope changes signs the assumption is that the program has stepped over to the other side of the minimum and so it cuts down the learning rate by the decay factor given by the `d' parameter. For instance, a d value of 0.5 cuts the learning rate for the weight in half. The `m' parameter specifies the maximum allowable value for an eta. The `t' parameter (theta) is used to compute a running average of the slope of the weight and must be in the range 0<=t<1. The running average at iteration i, a[i], is defined as:

a[i] = (1 - t) * slope[i] + t * a[i-1],

so small values for t make the most recent slope more important than the previous average of the slope. Determining the learning rate for back-propagation automatically is, of course, very desirable and this method often speeds up convergence by quite a lot. Unfortunately, bad choices for the delta-bar-delta parameters give bad results and a lot of experimentation may be necessary. If you have n patterns in the training set try starting eta and kappa around 1/n.