# Professional Basis of AI Backprop Hypertext Documentation

#### Copyright (c) 1990-97 by Donald R. Tveter

## Delta-Bar-Delta

The delta-bar-delta method
attempts to find a learning rate, eta, for each individual
weight as the training proceeds. The parameters are the
initial value for the etas, the amount by which to increase an
eta that seems to be too small, the rate at which to decrease
an eta that is too large, a maximum value for each eta and
a parameter used in keeping a running average of the slopes.
Here are examples of setting these parameters:

d d 0.5 * sets the decay rate to 0.5
d e 0.1 * sets the initial etas to 0.1
d k 0.25 * sets the amount to increase etas by (kappa) to 0.25
d m 10 * sets the maximum eta to 10
d t 0.7 * sets the history parameter, theta, to 0.7

These settings can all be placed on one line:

d d 0.5 e 0.1 k 0.25 m 10 t 0.7

The version implemented here does not use momentum.
The idea behind the delta-bar-delta method is to let the
program find its own learning rate for each weight. The `e'
sub-command sets the initial value for each of these learning
rates. When the program sees that the slope of the error
surface averages out to be in the same direction for several
iterations for a particular weight the program increases the
eta value by an amount, kappa, given by the `k' parameter.
The network will then move down this slope faster. When the
program finds the slope changes signs the assumption is that
the program has stepped over to the other side of the minimum
and so it cuts down the learning rate by the decay factor
given by the `d' parameter. For instance, a d value of
0.5 cuts the learning rate for the weight in half. The `m'
parameter specifies the maximum allowable value for an eta.
The `t' parameter (theta) is used to compute a running
average of the slope of the weight and must be in the range
*0<=t<1*. The running average at iteration i, a[i],
is defined as:

*a[i] = (1 - t) * slope[i] + t * a[i-1]*,
so small values for t make the most recent slope more important
than the previous average of the slope. Determining the
learning rate for back-propagation automatically is,
of course, very desirable and this method often speeds up
convergence by quite a lot. Unfortunately, bad choices for
the delta-bar-delta parameters give bad results and a lot
of experimentation may be necessary. If you have n patterns
in the training set try starting eta and kappa around 1/n.