Shared hardware
- Suffers from interference between tasks demanding different solutions
- Positive and negative transfer
- Spatial crosstalk: output units provide conflicting
error information to hidden units
- When a single HU connects strongly to more than one OU, this can be a problem
- Temporal crosstalk: unit receives conflicting information at
different times during training
- First trained on task 1, network is trained on a significantly
different task 2;
or trained for many trials in one region of input, then for many
trials in another region
- Positive transfer only when task 2 benefits from task 1
- Catastrophic interference
- HUs that become useful in solving task 1 should not be involved in task 2
- But gradient descent preferentially modifies weights into HUs that
are already useful because the error on these units depends on
the weights coming out of them, which are large for useful HUs
Modularity
- Suitability of the architecture for solving the problem and generalizing
- Modularity: separate, interfering tasks solved with separate hardware
- Can take advantage of function decomposition
Example: absolute value function, which requires no HUs if learned in
two separate networks
- Can handle both positive and negative transfer
- Generalization is better when a complex function is decomposed
into several simpler functions; global and local generalization
- Results in more localized representations which may be easier for
researchers, and for other portions of the system, to
interpret
- Facilitates the use of attentional mechanisms, which presumably
enhance or suppress the activations of some, but not
other, units
- Facilitates incremental development of systems (for practical purposes)
- When different dimensions are represented in separate modules,
there is a saving in hardware
in a coarse-coded system, number of units required is
`N^k/D^(k-1)` (N the number of just-noticeable differences in each dimension, k the number of dimensions, D the diameter of each unit's receptive field)
- Hardwired modularity: claims of innateness (nativism) in cognitive science
Adaptive modularity: temporal crosstalk
- Jacobs, Jordan, Barto; Jacobs, Jordan, Nowlan, Hinton
- Architecture
- Output and error (two types of networks)
- Output is a combination of the outputs of the experts
- Output of network
`vec y = sum_i p_i vec y_i`
- Error function for case c (i indexes the
expert networks)
`E^c = 1/2 sum_j (t_j^c - sum_i p_i^c y_(ji)^c)^2`
Usually other terms are added to the error function to encourage
only one network to be on at a time.
- Output of the network is the output of one of the experts.
- The gating network makes a stochastic decision about
which expert to use, using the gating outputs as probabilities.
- Error function
`E^c = 1/2 sum_i p_i^c sum_j(t_j^c - y_(ji)^c)^2`
Each network is responsible for all of the output for a given pattern;
encourages competition.
- The gating network determines not only how much (or whether) each
expert contributes to the output for a given input, but
also how much each expert learns for a given input.
- The What and Where tasks
- What: categorize input object; Where: assign location to input object
- Training: randomly varying or blocked What and Where
- Modular architecture (type 1) is superior to single backprop network for
random case
- Multi-speaker vowel recognition
- Modular architecture (type 2) learns faster than backprop network
Adaptive modularity: spatial crosstalk
- Architecture
- Equivalent to multiple modular networks each with a single gating network
- Output and error functions of type 1
- Trained on simultaneous What and Where tasks, experts specialized,
but performance was not superior to single network
- An example from language acquisition (Gasser, 1994)