Professional Version Basis of AI Backprop Hypertext Documentation

Copyright (c) 1990-97 by Donald R. Tveter

Recurrent Networks

Recurrent back-propagation networks take values from hidden layer and/or output layer units and copy them down to the input layer for use with the next input. These values that are copied down are a kind of coded record of what the recent inputs to the network have been and this gives a network a simple kind of short-term memory, possibly a little like human short-term memory. There are a lot of variations on what can be done with recurrent networks, all in all rather messy. The topics are:

A Memorization Example

One of the simplest things you can do with a recurrent network is memorize a sequence, normally there is no real use for this since you want a network to produce results on unknown data but the memorization problems are a good way to get used to recurrent networks.

The first memorization problem is this one: suppose you want a network to memorize the two short sequences, "acb" and "bcd". In the middle of both of these sequences is the letter, "c". In the first case you want a network to take in "a" and output "c", then take in "c" and output "b". In the second case you want a network to take in "b" and output "c", then take in "c" and output "d". To do this a network needs a simple memory of what came before the "c".

Let the network be an 7-3-4 network where input units 1-4 and output units 1-4 stand for the letters a-d. So the codes are:

a: 1000
b: 0100
c: 0010
d: 0001
In action, the network needs to do the following. When a is input, c must be output:

             0010     <- output layer

             hhh      <- hidden layer

          1000 stm    <- input layer
In this context, when c is input, b should be output:

             0100

             hhh

          0010 stm
For the other string, when b is input, c is output:

             0010

             hhh

          0100 stm
and when c in input, d is output:

             0001

             hhh

          0010 stm
This is easy to do if the network keeps a short-term memory of what its most recent inputs have been. Suppose we input a and the output is c:

             0010     <- output layer

             hhh      <- hidden layer

          1000 stm    <- input layer
Placing a on the input layer generates some kind of code (like a hash code) on the 3 units in the hidden layer. On the other hand, placing b on the input units will generate a different code on the hidden units. All we need to do is save these hidden unit codes and input them with a c. In one case the network will output b and in the other case it will output d. In one particular run inputting a produced:

               0  0  1  0

            0.993 0.973 0.020

           1  0  0  0  0  0  0

When c is input the hidden layer units are copied down to input to give:

                  0  1  0  0

              0.006 0.999 0.461

          0  0  1  0  0.993 0.973 0.020
For the other pattern, inputting b gave:

                  0  0  1  0

                0.986 0.870 0.020

              0  1  0  0  0  0  0
Then the input of c gave:

                    0  0  0  1

                0.005 0.999 0.264

          0  0  1  0  0.986 0.870 0.020

This particular problem can be set up as follows:

m 7 3 4
s 7
ci
t 0.2
rt {
1000 xxx   0010
0010 hhh   0100

0100 xxx   0010
0010 hhh   0001
}
where the first four values on each line are the normal input. The next group of 3 are the short term memory units, an x stands for the value of the unknown whose default value is 0.5 and an h means take the value from the next hidden layer unit. The last four values are the desired outputs. When the program reads the training and test files the `x' value is replaced by its current value. When an `h' value is loaded into the network the value is taken from the next unused hidden layer unit.

By the way this simple problem does not converge particularly fast and you may need to do a number of runs before you hit on initial values that will work quickly. It will work more reliably with more hidden units.

Another Memorization Example

Another recurrent network included in the examples is one designed to memorize two lines of poetry. The two lines were:

I the heir of all the ages in the foremost files of time
For I doubt not through all the ages ones increasing purpose runs
but for the sake of making the problem simpler each word was shortened to 5 characters giving:
i the heir of all the ages in the frmst files of
time for i doubt not thru the ages one incre purpo runs
The letters were coded by taking the last 5 bits of their ASCII codes. The net can learn this sequence rather quickly with a 45-20-25 network. However once upon a time I was wondering what would happen if the poetry network learned its lines and then the program was given several words in the middle of a verse. Would it pick up the sequence and be able to complete it given 1 or 2 or 3 or n words? So given for example, the short sequence "for i doubt" will it be able to "get on track" and finish the verse? To test for this there are an extra pair of commands, tr and trp. Given a test set (which should be the training set) they start at every possible place in the test set, input n words and then check to see if the net produces the right answer. For this example I tried n = 3, 4, 5, 6 and 7 with the following results:

[?!abcdefhiklmnopqrstuvw]? tr 3
 TOL:  81.82 %  ERROR: 0.022967
[?!abcdefhiklmnopqrstuvw]? tr 4
 TOL:  90.48 %  ERROR: 0.005672
[?!abcdefhiklmnopqrstuvw]? tr 5
 TOL:  90.00 %  ERROR: 0.005974
[?!abcdefhiklmnopqrstuvw]? tr 6
 TOL: 100.00 %  ERROR: 0.004256
[?!abcdefhiklmnopqrstuvw]? tr 7
 TOL: 100.00 %  ERROR: 0.004513
So after getting just 3 words the program was 81.82% right in predicting the next word to within the desired tolerance. Given 6 or 7 words it was getting them all right. The trp command does the same thing except it also prints the final output value for each of the tests made.

NARX Networks

There is no provision to take values from any hidden layer units other than the ones in the first hidden layer, so if your network is made with:

m 8 5 3 2
there is no way copy the three values in the second hidden layer down to the input layer. (There does not seem to be much need for this.) But you can copy output layer units down to the input with the following notation. Suppose the network is an 8-4-4 network and you want to copy the output layer values down to input. Use the input pattern:

0.1 0.2 0.3 0.4  o1 o2 o3 o4
where the first four numbers are normal input values and `o1' means copy the value from output unit 1, `o2' means copy the value from output unit 2, etc..

There is an additional provision to take values from the input layer and then "shift them along" the input layer as more patterns are added. For example you may want take a hidden unit value or an output unit value and keep them around for more than a single time step, in effect a history of what the recent values have been. (This is supposed to help learning and generalization in recurrent networks, see the paper: Learning Long Term Dependencies is not as Difficult with NARX Recurrent Neural Networks. (NARX comes from Nonlinear AutoRegressive models with eXogenous inputs.) Suppose the network is 5-5-1. Let the normal inputs be:

1 2 3 4 5 6 7 8 9
and let the targets at each of these steps be:

-1 -2 -3 -4 -5 -6 -7 -8 -9
and let the outputs at each of these steps be:

-1.1 -2.1 -3.1 -4.1 -5.1 -6.1 -7.1 -8.1 -9.1
So you could do this as a 1-5-1 problem. But you want to save outputs and shift them along the input units. Now suppose we're in the middle of this problem, the input was 5 and the output was -5.1. You want to keep that -5.1 around as a kind of memory of what the network has been producing. In fact you want to use it three more times before throwing it away completely. What is on the network NOW, what you got from the pass where the input was 5 and the output was -5.1 is:

                  -5.1

                h h h h h

          5 -1.1 -2.1 -3.1 -4.1
The next pattern you set up could look like this:

6 i3 i4 i5 o1    -6
Now with these numbers loaded on the input the network goes and creates a new output which is -6.1. The network looks like:

                    -6.1

                  h h h h h

            6 -2.1 -3.1 -4.1 -5.1
The next pattern given to the net will be:

7 i3 i4 i5 o1    -7
Now with these numbers loaded on the input the network goes and creates a new output which is -7.1. The network looks like:

                     -7.1

                   h h h h h

             7 -3.1 -4.1 -5.1 -6.1
The next pattern given to the net will be:

8 i3 i4 i5 o1    -8
Now with these numbers loaded on the input the network goes and creates a new output which is -8.1. The network looks like:

                     -8.1

                   h h h h h

            8 -4.1 -5.1 -6.1 -7.1
And it keeps going on. Here are the inputs to the network over all these steps all lined up:

          5 -1.1 -2.1 -3.1 -4.1
          6 -2.1 -3.1 -4.1 -5.1
          7 -3.1 -4.1 -5.1 -6.1
          8 -4.1 -5.1 -6.1 -7.1
Notice how the latest output value arrives on the right and then it gets shifted one location to the left for each new pattern until the value finally "falls off" or "goes away".

Now you're kind of stuck at the beginning because you don't have four saved output values to work with. This version of the program now looks at the data for the letters, i, o, and h and if it finds them it sets an internal flag and when training is done or the command is to evaluate the training set then the network is set to the current value of the unknown, x. So you can input the following patterns and the network will have initial values to work with:

1 i3 i4 i5 o1   -1
2 i3 i4 i5 o1   -2
3 i3 i4 i5 o1   -3
4 i3 i4 i5 o1   -4
5 i3 i4 i5 o1   -5
6 i3 i4 i5 o1   -6
7 i3 i4 i5 o1   -7
8 i3 i4 i5 o1   -8
9 i3 i4 i5 o1   -9
If you really wanted the new input (the 1, 2, 3 ...) to go on the right of the pattern instead of the left of the pattern you could do this:

 i2 i3 i4 o1 1  -1
 i2 i3 i4 o1 2  -2
 i2 i3 i4 o1 3  -3
 i2 i3 i4 o1 4  -4
 i2 i3 i4 o1 5  -5
 i2 i3 i4 o1 6  -6
 i2 i3 i4 o1 7  -7
 i2 i3 i4 o1 8  -8
 i2 i3 i4 o1 9  -9
Now here's something you can't do:

1 o1 i2 i3 i4   -1
2 o1 i2 i3 i4   -2
3 o1 i2 i3 i4   -3
4 o1 i2 i3 i4   -4
5 o1 i2 i3 i4   -5
6 o1 i2 i3 i4   -6
7 o1 i2 i3 i4   -7
8 o1 i2 i3 i4   -8
9 o1 i2 i3 i4   -9
You can see why by taking a close look at pattern 5:

Remember there is a simple "command" in the program that lets you evaluate a pattern simply by typing it at the bottom of the W95 screen or in the Tcl/Tk entry box at the bottom of the screen. So for the xor problem you could type:

0 1
and it was not a problem. With recurrent networks, however, there are letters (such as i, o and h) that could well be legitimate commands and if they came first the program would end up doing those commands. To avoid this problem use a `\' at the start of the pattern as in this example:

\ o1 hhhhh
THIS IS REQUIRED ONLY AT THE COMMAND LINE LEVEL. When the program has been told to read patterns (the rt, rx and tf commands) using the `\' will cause an error.

As noted under the `Making a Network' command every time the number of input or output units is changed the patterns are normally thrown away. To avoid this the size of the network can be specified as:

m 4+3 3 4
But another problem is that when you change the number of hidden and input units the patterns don't match anymore. That is, there are more or less `h' codes stored in the patterns than there should be. To get around this there is another notation for patterns that solves this problem. The simple memorization task given above can be coded as:

m 4+3 3 4
a dd us
ss M 5 m 0.1
s 7
ci 2
t 0.2
rt {
1000 X   0010
0010 H   0100

0100 X   0010
0010 H   0001
}
In this case the X stands for as many values as are necessary to fill up the rest of the input units. The value is the value of the unknown (x) at the time the pattern is loaded into the network and it will not be the value at the time it is read if you change its value after reading the patterns. Likewise the H stands for take as many values from the first hidden layer as are needed to fill out the input layer. Because X and H stand for "as many as are necessary" to fill out the rest of the input pattern these should only be used on the right side of the input pattern.

Time Series: The Predict Command

Rather than using recurrent networks to memorize sequences of letters they are probably more useful at predicting the value of some variable at time t+1 given its value at t, t-1, t-2, ... . A very simple example of this is to give the value of sin(t+1) given a recent history of inputs to the net. Given a value of sin(t) the curve may be going up or down and the net needs to keep track of this in order to correctly predict the next value. The following setup will do this:

m 1+1 1 1
f ir
a aol dd uq
qp e 0.02
ci
rt {
   0.00000  X   0.15636
   0.15636  H   0.30887
   0.30887  H   0.45378

   . . .

  -0.15950  H  -0.00319
  -0.00319  H   0.15321
}
and in fact it converges rather rapidly.

When you use a recurrent network to predict the next element of a time series you run into a problem. You can normally only predict THE NEXT ELEMENT of the time series and NO FURTHER. For instance suppose you've trying to predict the stock market. Your inputs may be the current value of the market index (or the most recent change), the inflation index, the unemployment index, the prime rate, money supply and so on. Or perhaps the best thing to do is to input changes or percent changes of these values, I shall leave that issue to the experts. At any rate you can train a network to predict the next value and if you've used all the inputs you have (right up to today) you will get a prediction for the next value of the market index but since you don't know the next set of input values will be you can't go any farther. Now you could ASSUME that the input values (prime rate, inflation rate, etc.) will not change much and continue to input the last values you have but that is risky.

To evaluate the performance of such a network you can do the following. Suppose you have a total of 81 patterns you can work with. You can train the network on the first 61 patterns and then let it predict the next value of the index. This gives you only one market index value as a prediction. Now train another network on the first 62 patterns. This gives you a second output value. Now train the network on the first 63 patterns to give another output value. If you continue doing this you can get 20 predictions, average the error and thereby get an estimate of how well your network has learned the ebb and flow of the market. Under this plan the number of training set patterns keeps increasing. Sometimes this may be a good idea and sometimes it is better to limit the number of training set patterns. If you think the series is, as the Mathematicians say, a stationary process, a series where the underlying fundamentals do not change then you should use all your data. On the other hand if the underlying fundamentals do change with time you may be better off using only a block of recent data.

To make such testing easy there is an additional command called predict that takes one data file, writes out a training set file and a test set file with a single pattern. It then uses the benchmarking function for training (allowing more than one network to train), writes out the results on a file called bresults.tmp, then the predict command reads the results and then sets up another run with a new data set, and so on until all the possible sets have been used. There are more details on this in the predict command documentation.

Recurrent Classification Problems

Another type of recurrent network problem is to take in a series of measurements over time, perhaps like an electrocardiogram, and then classify the series as being normal or abnormal. In this type of problem a whole series of "minor patterns" are used to give an answer to one "major pattern". It will be called a "recurrent classification problem." In order to deal with this type of problem the program must be told how many minor patterns there are within each major pattern. Suppose there are 128. Then the program must be told: "f r 128" and EVERY major pattern must contain 128 minor patterns. The classification is made by averaging the output units over all the minor patterns rather than adding up votes contributed by each minor pattern. To get the run command to output statistics based on major patterns rather than minor patterns you must also do: "f sr". This prints the average value of the outputs for all the minor patterns in the series. In the normal use of the program it stops when all the patterns have been learned to within the desired tolerance or when the overall tolerance has been met. With recurrent classification problems the program DOES NOT stop when the major patterns meet the closeness criteria, it goes on until all the minor patterns meet that criteria or the overall tolerance has been met.

A sample recurrent classification problem, sinclass.bp, is included in the set of examples. It simply has to learn the difference between the two patterns, sin(x) and sin(2x). It is set up as follows:

* In this recurrent network example there are two sine waves being
* presented to the network.  First for class 1 the inputs are x and
* sin(x).  Then for class 2 the inputs are x and sin(2x).  The network
* has to learn that only the second input value can be used to determine
* which class each sequence belongs to.
m 2+2 2 2
a dd us
ss e 0.05 M 5
f ir pc r 41
rt sinclass.dat
s7
ci
The data looks like this:

 0.00000  0.00000 X  1  * the first pattern is class 1
 0.15700  0.15636 H  1
 0.31400  0.30887 H  1
   ...
 6.12300 -0.15950 H  1
 6.28000 -0.00319 H  1

 0.00000  0.00000 X  2  * the second pattern is class 2
 0.15700  0.30887 H  2
 0.31400  0.58753 H  2
   ...
 6.12300 -0.31492 H  2
 6.28000 -0.00637 H  2
This example converges rather rapidly and the network produces the right answers even when many minor patterns are still wrong.

To get information on the training set patterns for a recurrent classification problem there is a line of buttons in the P menu-window or the typed command is the u command (Sorry, I've run out of great one and two letter command names! Maybe someday this will be changed.). At any rate the u command is the analog to the p command, these are examples:

To get the information on testing set patterns there is a line of buttons in the P menu-window or there is the the v command:

Recurrent Network Warnings

NOTE: With a conventional network you can ask for the output for a particular pattern number, say pattern number 5 by using:

p 5
and the network loads pattern 5 onto the network and reports the output values BUT the program works the same way with a recurrent network and the answer will not be right because it depends on hidden unit values that are not available when the single pattern is evaluated. To get a single value within the series use the p command which DOES initialize the network correctly and DOES work through them all. The same situation occurs if you want to evaluate a single test pattern with the t command.

NOTE: If your test set is a continuation of the training set then to get the correct values for the test set you must evaluate the training set first.