Learning parameters and training
- Pattern update
- Does not implement true gradient descent
- With randomized order, makes the search through weight space stochastic
- Momentum
- How gradient descent is sensitive to step size
- What momentum is
`Delta w_(ji)(n + 1) = eta delta_j^p x_i^p + alpha Delta w_(ji)(n)`
- How momentum helps
- Activation functions
- Sigmoid
- Derivative is maximum in the middle; approaches zero at extremes
- Extreme values cannot be reached
- Responds to inputs above some value; does not match inputs
- Gaussian: responds to inputs in a particular range (common in radial basis function networks)
- Error functions
- Usual quadratic error function
- Cross-entropy
- Based on the conditional probability of the targets, given the inputs
`E = -sum_p ln P(vec t^p|vec x^p)`
- Making certain assumptions about the distribution of the target elements, we get this error function
`E_(CE) = -sum_p sum_j [t_j ln y_j + (1 - t_j) ln(1-y_j)]`
- Eliminates derivative of activation function in error term
for output units, solving the sigmoid derivative problem
`(del E_(CE))/(del w_(ji)) = -x_i (t_j - x_j)`
- Adaptive learning rate
- Learning rate may be allowed to change with error
`Delta eta = {(+a,if Delta E < 0, text(consistently)),(-b eta,if Delta E > 0,), (0, text(otherwise),):}`
- Learning rate can also vary with patterns or with connections
- Dealing with local minima
- Initial weights
- Zero: weights never change
- Random, too large: sigmoids saturate early
- Best: random such that magnitude of input to units is close to 1
- Minima in which errors for different patterns compensate for each other; noise helps escape
from the minima
- Randomized pattern update
- Noisy weights
- Noisy training inputs
- Overfitting: a network with too many weights may succeed on
the training set but fail to generalize
- Weight decay: optimizing an architecture by pruning unimportant
weights
- Penalty added to cost function
`E = E_0 + ½ epsilon sum_i sum_j w_(ji)^2`
- Weight update
`w_(ji)^(text(new)) = (1-epsilon) w_(ji)^(text(old))`
- Incremental learning; varying the training set
Multiplicative connections
- Motivation
- Standard units take linear combination of their inputs, but for some problems, general polynomial combinations may be appropriate.
- Sigma-pi units
- Architecture (pairwise products only)
- Input rule
`h_j = sum_i w_(ji) prod_k x_(i_k)`
-
Problem: great increase in number of weights for large input spaces
- Product units (Durbin & Rumelhart)
- Architecture
- Bias unit with constant activation of -1
- Activation rule for a product unit (there is no squashing function)
`y_j = prod_i x_i^(p_(ji))`
- Equivalent to
`y_j = e^(ln(prod_i x_i^(p_(ji)))) = e^(sum_i p_(ji) ln x_i)`
- For negative inputs, the logarithm is complex.
Authors argue that it is possible to ignore the imaginary component of the output of product units.
- Learning rule for weights into (hidden layer) product units
`I_i -= {(1,if x_i text(is negative)),(0, text(otherwise)):}`
`U -= sum_i p_i ln |x_i|`
`V -= sum_i p_i I_i`
`Delta p_(ji) = eta delta_j (ln |x_i|e^U cos pi V - I_i pi e^U sin pi V)`
- Applications in which product units are superior to summing units
- Boolean (inputs -1 or +1)
- Parity: output unit turns on when even number of inputs on
- Symmetry: output unit turns on when one half of inputs mirrors other half
- Multiplexer: two input units encode which of other input units is transmitted to output.
- Informational capacity of a unit: number of random boolean patterns it can learn
- Real-valued
- Responding to a circular region in two-space