Principal Component Analysis
- Principal component analysis (PCA) is an orthogonal linear transformation that constructs a new coordinate system such that the dimensions are ordered by how much variance they account for in the data.
-
Formally, each dimension in the PCA-transformed data is an eigenvector of the covariance matrix of the data in the original dimensions, and its eigenvalue is its contribution to the overall variance.
-
Each eigenvector specifies a way to convert a point in the original space to a value along one of the new dimensions.
Networks that learn the first principal component (Oja)
- Inputs drawn from some probability distribution.
Single output unit should tell us how well a given input conforms to the
distribution.
- Output unit is linear.
`y = sum_(i=1)^N w_i x_i`
- Simple Hebbian learning is the right idea (frequent input
patterns have the most influence and produce the largest
output), but the weights grow without bound.
`Delta w_i = eta yx_i`
- Adding weight decay proportional to `y^2` (the "forgetting" term)
moves the weights toward unit length (reverse delta rule;
Oja's rule).
`Delta w_i = eta y(x_i - yw_i)`
- Output is the component of the input along the direction of the
weight vector.
- The weight vector lies in a direction that maximizes the variance
of the data.
Networks that learn principal components (Sanger)
- Find M orthogonal vectors in data space that account for
as much as possible of the data's variance. Projecting
the data from the original N-dimensional space
onto the M-dimensional subspace spanned by the
vectors performs dimensionality reduction.
- Linear output units
- Generalized Hebbian Algorithm (Sanger's Rule): similar to Oja's Rule, except that weighted outputs up to the
output unit in question are subtracted out.
`Delta w_(ji) = eta y_j(x_i - sum_(k=1)^j y_k w_(ki))`
- Weight vectors converge to orthogonal unit vectors, the first
M principal component directions
- Rule is not local, but a local implementation is possible (with
duplicate weights).
`Delta w_(ji) = eta y_j[(x_i - sum_(k=1)^(j-1) y_k w_(ki)) - y_jw_(ji)]`
- Encoding an image
- Trained on blocks in image corresponding to input unit size N.
- The weights into the M output units represent masks over the image blocks.
- For each image block, read off the values of the output units, and quantize these
with a number of bits proportional to the log of the variance of each output unit over
the whole image.
- Image represented in compressed form as quantized output unit values, M for each block,
each with reduced number of bits per pixel.
- Test on untrained image.
Auto-association
- Feedforward network with input and output layers the same size and a smaller hidden layer.
- Target is the same as the input pattern (perhaps a noisy version of it).
- After training,
- Hidden layers patterns are compressed representations of the
inputs.
- The input-hidden layer subnetwork is a pattern encoder.
- The hidden-output layer subnetwork is a pattern decoder.
-
With linear hidden units, auto-association implements PCA, except that the resulting dimensions may not be orthogonal.
-
Application to emotion recognition (Dailey et al.): filtered images of faces are converted to a compressed representation using auto-association (or PCA), and a simple classifier is trained to classify these representations for one of the six basic emotions
Distributed representations and structure-sensitive operations (Chalmers)
- Ternary RAAM network trained to encode active and passive sentences.
- Feedforward network trained to associated distributed representations of active sentences with distributed representations of corresponding passive sentences.
- Feedforward network tested on novel sentences.
- With RAAM trained on all of the sentences, the feedforward network exhibited 100% generalization to novel sentences.