Reinforcement learning
Markov decision processes
- The agent and the environment (the
world)
- Discrete time
- States
At each time step, the agent's sensory/perceptual system returns
a state, xt, a representation of its current situation in the environment,
which may be errorful and normally misses many aspects of the "real"
situation.
- Actions
At each time step, the agent has the option of executing one of a finite
set of possible actions, ut, each of which potentially puts it in a new state.
- Reinforcements:
rewards and punishments
In response to the agent's action in a particular state, the world
provides a reinforcement.
- The reinforcement function in the world:
r(x,u)
- The next-state function in the world:
s(x,u)
- An example
Simple reinforcement learning
Q learning
- The real value of an action in a state (optimal Q) depends not only on immediate
reinforcement but also on reinforcements that can be received later as a
result of the next state the agent gets to.
-
The value (estimate Q) that an agent stores for each state-action pair should reflect
how much reinforcement it will receive immediately and in the future if
it takes that action in that state.
- Policy: a way of using the stored Q values to
select actions.
-
More precisely, an optimal Q value for a given state and action is
the sum of all reinforcements received if that action is taken in the state, and then the agent follows the optimal policy specified by the other Q
values.
A first definition:
Qopt(xt, ut) =
r(xt, ut) +
maxut + 1[Qopt(xt + 1, ut + 1)]
-
But this causes problems because there may be many, even an infinite number of,
future reinforcements.
We need to weight the future by a discount rate (γ) between
0 and 1.
Qopt(xt, ut) =
r(xt, ut) +
γ maxut + 1[Qopt(xt + 1, ut + 1)]
-
To approach optimal Q values, the learner starts with 0 or random values
for each state-action pair, then updates the values gradually usually
the reinforcement received and what it thinks is the best Q value for
the next state.
Qnew(xt, ut) =
(1 - η)Qold(x, u) +
η{r(xt, ut) +
γ maxut + 1[Qold(xt + 1, ut + 1)]}
An example
γ = 0.8, η = 0.5 and all Q values initialized at 0. In the chart, "new" means
the reinforcement received plus the discounted maximum value of the next state.
The "new" value is combined with the "old" using the learning rate to give the
updated Q value appearing in the next line of the chart.
(Note: in this example, in order to illustrate how the agent can learn to "look ahead", it is effectively picked up after it reaches the goal state and dropped back in state 1. There is no "natural" way of reaching state 1 from state 4.)
| x |
Q |
new |
u |
| 1,r |
2,r |
2,l |
3,r |
3,l |
4,l |
| 1 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
r |
| 2 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
r |
| 3 |
0 |
0 |
0 |
0 |
0 |
0 |
1 |
r |
| 4 |
0 |
0 |
0 |
.5 |
0 |
0 |
0 |
l |
| 1 |
0 |
0 |
0 |
.5 |
0 |
0 |
0 |
r |
| 2 |
0 |
0 |
0 |
.5 |
0 |
0 |
.4 |
r |
| 3 |
0 |
.2 |
0 |
.5 |
0 |
0 |
1 |
r |
| 4 |
0 |
.2 |
0 |
.75 |
0 |
0 |
0 |
l |
Making decisions
- How is the agent to pick an action?
One possibility is exploitation, to pick the
action that has the highest Q-value for the current state.
- But the agent can only learn about the value of actions that it tries.
Thus it should try a variety of actions.
This may involved ignoring what it thinks is best some of the time:
exploration.
- Exploration makes more sense early on in learning when the agent
doesn't know much.
- One possibility for selecting an action;
pick the "best" action with probability
P = 1 - e-E a, where a is the number
of training samples (the "age" of the agent).
Here is how the probability of selecting the "best" action
depends on age
when E is 0.1.
Here is how it depends on age when E is 0.01
-
A smarter possibility would be to have the probability of picking an action
depend on how high its value is relative to the values of all of the other
possible actions.
Here is one way:
where the vs represent all of the possible actions in state
xt.
Implementing Q learning
- A lookup table: a Q value for each state-action pair
- But in the real world the number of states may be very large, even
infinite.
Distributed representations of states permit
- More efficient coding
- Generalization to novel states
-
A neural network
- Inputs are distributed representations of states.
- Outputs Q values for each action (represented locally).
- Weights represent associations of state features with actions.
- Error-driven learning: for the selected action, the target
is the "new" Q value from the Q learning rule.
| Computer.Science@IU | Fall 2004 |
| B551 |
Elements of Artificial Intelligence |