- Unit activation
`x_j(t+1) = g(h_j(t)) = g[h(vec x(t), vec w_j(t))]`
-
Only element in the sum of error terms that depends on the weight is
the one for the output unit where that weight ends (j).
(1) `(del E) / (del w_{ji})`
-
Using the chain rule, we can decompose this derivative into two that are easier
to calculate:
(2) `(del E)/(del h_j) (del h_j)/(del w_{ji})`
-
The first derivative can also be decomposed:
(3) `(del E)/(del h_j) = (del E)/(del x_j) (del x_j)/(del h_j)`
-
The first derivative in 3 is easy to figure; it's just
(4) `(del E)/(del x_j) = -(t_j - x_j)`
-
Since the activation of an output unit is the activation function g
applied to the input h, the second derivative in 3 is just
(5) `(del x_j)/(del h_j) = g prime(h_j)`
that is, the derivative of whatever the activation function is at the value
of the current input to unit j.
-
The second derivative in (2) can be figured as follows:
(6) `(del h_j)/(del w_{ji}) = (del sum_k x_k w_{jk})/(del w_{ji}) = x_i`
because none of the other weights or input activations depend on wji.
-
Putting all of the parts together, we get
(7) `(del E)/(del w_{ji}) = -(t_j - x_j) g prime(h_j) x_i`
-
Remember that we want the weight change to be proportional to the negative
of the derivative with respect to the weight. So with a learning rate
to control the step size for weight changes, we get the more general
delta (least mean squares) learning rule
(8) `Delta w_{ji} = eta (t_j - x_j) g prime(h_j) x_i`
-
With a linear activation function, the weight change reduces to `eta (t_j - x_j) x_i`.
With the sigmoid, it becomes `eta (t_j - x_j) x_j(1 - x_j) x_i`, with tanh,
`eta (t_j - x_j) (1 - x_j^2) x_i`.