Vector quantization
- Goal: represent each pattern by the class it is in, defined in terms of
a prototype "codebook" vector
- Unsupervised VQ: the prototype vectors are the weights into the category units after learning
- Supervised VQ (learning vector quantization, LVQ):
update weights into the winner depending on whether it is the correct class;
if it is correct, move it closer to the input; otherwise, move it away from the input
- LVQ2.1 (Kohonen): improves on LVQ: after LVQ, when the input pattern is near the boundary between two weight vectors and one is correct, the other wrong, move the correct one closer, the other one away from the input, using a small learning rate
Using multiple sources of information
-
Self-supervision: separate sensory modalities train each other
- The Minimizing-Disagreement algorithm (de Sa and Ballard)
- Three-layer network: separate hidden layers for each input modality;
single output layer for the "labels" for the hidden-layer codebook vectors
-
First train hidden and output layers using competitive learning
- Disagreement error: number of patterns classified differently by the two modality
subnetworks
- Minimizing disagreement error (adaption of LVQ1.2)
- Alternate teaching and taught modalities.
- For a given pattern, find the output label for the teaching modality only.
- Within the taught modality's hidden layer, find two winning units.
- If the input pattern is close to the border between the two winners in the taught modality,
and one agrees with the output winner,
move the codebook vector of the agreeing unit toward the input pattern, the other away from the input pattern.
-
Increase the weight from the taught winning unit to the winning output unit.
- Normalize the hidden-to-output weight vectors.
- Tested on problem of learning to recognize consonant-vowel utterances on the basis of visual and auditory information: performance is comparable to supervised LVQ2.1
- Stability-plasticity dilemma
- No guarantee of stability when learning is incremental (rather than
batch)
- We could reduce the learning rate during training to bring about
stability.
- But then the network loses the ability to respond to new data
(plasticity).
- Solution in ART: use output units only when needed; accept and
adapt a stored prototype only when input is similar enough
to it, dependent on vigilance (ρ). High
vigilance leads to many finely divided categories, low
vigilance to coarse categorization.
- When there is no prototype sufficiently similar, select an
"uncommitted" output unit if there are any left. If not,
give no response.
- ART1
- Input vectors and weights binary (0/1). Output units are
enabled or disabled.
Initially all weights are 1, and output units are uncommitted.
- For each presented pattern,
- Enable all output units.
- Find the winner among enabled output units (exit if none left),
the one for which the bits in the input pattern match the greatest proportion
of weights in the output unit, that is, the one for which
`bar(vec w)_j * vec x`
is largest, where `vec x` is the input pattern and
`bar(vec w)_j = (vec w_j) / (epsilon + sum_i w_{ji})`, that is, the normalized
weight vector into unit j.
The ε breaks ties.
- Test whether the match between `vec x` and `vec w_{j text(*)}`
is good enough:
`r = (vec w_{j text(*)} * vec x) / (sum_i x_i)`
(the fraction of bits in the input pattern that are in the weight vector).
If `r >= rho`, there is resonance; go to 4.
Otherwise, disable unit j* and go to 2.
- Adjust the winning vector by deleting any bits in it that are not
also in `vec x` (masking).
- Plasticity as long as there are unused output units.
- Stability comes from the fact that the learning rule can only
remove bits from the prototype vector.
- Example (from Fausett)
- Pattern 1
- Pattern 2
-
Pattern 3
-
Pattern 4
-
Pattern 4 with a different vigilance
Networks that learn principal components (Sanger)
- Find M orthogonal vectors in data space that account for
as much as possible of the data's variance. Projecting
the data from the original N-dimensional space
onto the M-dimensional subspace spanned by the
vectors performs dimensionality reduction.
- Linear output units
- Generalized Hebbian Algorithm (Sanger's Rule): similar to Oja's Rule, except that weighted outputs up to the
output unit in question are subtracted out.
`Delta w_(ji) = eta y_j(x_i - sum_(k=1)^j y_k w_(ki))`
- Weight vectors converge to orthogonal unit vectors, the first
M principal component directions
- Rule is not local, but a local implementation is possible (with
duplicate weights).
`Delta w_(ji) = eta y_j[(x_i - sum_(k=1)^(j-1) y_k w_(ki)) - y_jw_(ji)]`
- Encoding an image
- Trained on blocks in image corresponding to input unit size N.
- The weights into the M output units represent masks over the image blocks.
- For each image block, read off the values of the output units, and quantize these
with a number of bits proportional to the log of the variance of each output unit over
the whole image.
- Image represented in compressed form as quantized output unit values, M for each block,
each with reduced number of bits per pixel.
- Test on untrained image.
Auto-association
- Feedforward network with input and output layers the same size and a smaller hidden layer.
- Target is the same as the input pattern (perhaps a noisy version of it).
- After training,
- Hidden layers patterns are compressed representations of the
inputs.
- The input-hidden layer subnetwork is a pattern encoder.
- The hidden-output layer subnetwork is a pattern decoder.
-
With linear hidden units, auto-association implements PCA, except that the resulting dimensions may not be orthogonal.
-
Application to emotion recognition (Dailey et al.): filtered images of faces are converted to a compressed representation using auto-association (or PCA), and a simple classifier is trained to classify these representations for one of the six basic emotions
Distributed representations and structure-sensitive operations (Chalmers)
- Ternary RAAM network trained to encode active and passive sentences.
- Feedforward network trained to associated distributed representations of active sentences with distributed representations of corresponding passive sentences.
- Feedforward network tested on novel sentences.
- With RAAM trained on all of the sentences, the feedforward network exhibited 100% generalization to novel sentences.