Error is propagated along recurrent connections only one time step
back.
Implementation
Training an SRN on prediction
The input and output layers represent a single sequence event.
During training, sequences of inputs are presented repeatedly.
On a single training trial, an event is presented to the input layer, and the network is run in the usual fashion, with the context layer treated as another input layer.
The target is the next event in the sequence. Error is back-propagated, and weights are updated using the backpropagation rule, with context-to-hidden weights treated exactly as input-to-hidden weights.
Finally the activations on the hidden layer are copied to the context layer.
Ideas
In many ways the input can be thought of as a function and the context as the function's argument.
A null argument means the function produces its baseline output, what would occur next all things being equal; a point in multidimensional space.
A non-null argument means that output is shifted in some dimension(s).
The range of outputs, that the set of possible arguments to an input function produces, is the bounds in multidimensional output space of the partition that input represents.
An Elman net is learning a set of functions, one function for each possible input.
These functions are then strung together by the sequence that is being processed.
The function mapping from hidden layer to output layer is constant once the weights are trained. This means we can look at the hidden layer values to identify the partitions that the various input functions map to.
Tasks
Prediction
For example in Finding Structure in Time, one task involves predicting the the next word in a sequence. Each word
belongs to a syntactic category and each sentence follows a syntactic template, the network then learns the categories and templates and is able produce a probability distribution
for the next word.
The context layer ensures that same input word is not always treated the same.
The information encoded in the hidden layer bears some similarity to our mental lexicon.
Different words map to different partitions in the hidden unit space, their exact position in that partition determined by the context.
Partitions end up grouped together into larger partitions based on syntactic category.
Similar words even end up in near each other because they are used in similar situations.
A network that learns sequences of the form (anbn)*.
Output-layer recurrence: Jordan nets
Simple version: network outputs combined with decayed version of
context layer activations to yield new context layer
ci (t) = μ ci (t - 1) + (1 - μ) yi (t - 1)
ci the activation of the ithe context unit,
mu a decay parameter between 0 and 1.
Because the input patterns change as the weights change, this is not
true gradient descent.
Teacher forcing version: use targets rather than output unit
activations for context layer
ci (t) = μ ci (t - 1) + (1 - μ) ζi (t - 1)
where ζi is a target for unit i.
The input patterns are constant, so this is true gradient descent, but
the result may not be as robust.
Time delay networks not only receive the input from the current time step but the input from a set number of previous time steps.
Each of the connections has a different weight, so input form previous time steps can be weighted differently than the current input.
Generally used for sequence recognition tasks such as speech recognition.
One problem, is deciding how many time steps back to recieve input from.
Largely superceded by Elman nets, but occasionally appear in combination with other types of recurrence in multi-recurrent networks.
Back-propagation with recurrence
Back-propagation through time
Gradient descent in a network with hidden-layer recurrence
Recurrent network is unwrapped with separate units for each time
step in the sequence to maintain activations for the "real" units.
The weights in the unfolded network are constrained to be
independent of time.
Weight changes on connections are accumulated as sequence is
presented, and weights are updated at end of sequence.
Training can be done after each input or only after certain inputs. So, you can wait to train until the network has seen enough information to potentially be successful.
Goal: gradient descent, on-line sequence learning in general
recurrent networks without duplicating units
Principle: Propagate gradient information forward instead of propagating error backwards
Activation rule
where ξi is an external input to unit i.
Error
where ζk is a target for unit k.
Learning rule
This relates the derivatives of activations with respect to weights
at time t to those at time t-1.
We can calculate them by iterating forward from the initial
condition
The derivatives of the output activations with respect to the weights represent the sensitivity of that output to changes in that weight.
At each time step
Update the activations of the units.
Calculate the error for each unit.
Calculate the derivatives, using the stored derivatives from the
previous time step.
Calculate the weight changes for the time step.
At the end of the whole time interval, update the weights using
the sum of the weight changes for each time step.
For N fully recurrent units, there are
N3 derivatives to maintain.
The algorithm is not local; updating each weight requires access to information about all of the other weights in the network.
Updating each derivative takes time proportional to N, for the
whole network N4, but is highly parallelizable.
Teacher forcing
Correct the activations of units which are wrong after computing
error and derivatives.
Set derivatives for all units that were forced to targets to 0 after weight
changes are computed.
Handling rhythmic patterns
Tasks
"Event" detection
Periodicity detection
Meter and downbeat detection
Learning sensitivity to multiple meters
Learning rhythmic patterns superimposed on meters
Oscillators: units with built-in periodicities and phases:
activated by an input to the extent that it is aligned with the unit's
pulse
Incremental activation function
Hierarchies of connected oscillators with periodicities that are multiples of one
another
Meter (hierarchical periodic structure) detection: simultaneously firing oscillators reinforce one another
Adaptive oscillators
Oscillators that adjust their phase angles toward an input beat
Oscillators that adjust their periods to agree with input events