As with value iteration, we can use the delta rule or backpropagation to
update the weights.
The learning rule is:
`Delta w_{ji,t} = eta [ [r(vec s_t, a_t) + gamma max_{a_(t+1)} Q(vec s_{t+1}, a_{t+1})] - Q(vec s_{t}, a_t) ]
(del Q(vec s_t, a_t))/ (del w_{ji})`