Dialog systems

Automatic speech recognition
Natural language understanding, voice search
Dialog manager
- System initiative and user initiative
- Choosing a dialog action based on a dialog state
- Learned and Built-in strategies
- Reinforcement learning for dialog management
Natural language generation
Text-to-speech

Passonneau, Epstein, Ligorio, Gordon, and Bhutada

Dealing with noisy ASR (high word error rate) in dialog systems
Voice search
Wizard-of-Oz studies
Responses to misunderstandings and non-understandings
CheckItOut
Experiments
- ASR and query return
- Wizard task
- Modeling wizard actions
  - Features: display type of returned list, recent success, contiguous word match, number of candidates, confidence in ASR output
  - Behavior of most successful wizard
Implication for design of dialog systems

Reinforcement learning: the basic idea

Reinforcement learning: the problem of how an agent figures out how it can act in an environment so as to maximize long-term reward
Markov decision processes
- States and actions
  - The "real" state of the agent and what it "thinks" its state is
  - Actions take agents from one state to another.
  - In each state there is a set of actions to choose from.
- Discrete time
- Reinforcements
  - Rewards and punishments
  - Reinforcements depend only on the current state and the action taken.
  - The reinforcement function is built into the environment and is not known to the agent; it may be non-deterministic.
- The next-state function
  - Function of the current state and the action.
  - The next-state function is built into the environment and is not known to the agent; it may be non-deterministic.
  - The actual value of a state and what the agent thinks the value is
  - The actual value of an action in a given state and what the agent thinks the value is (Q-values)
Policies
- A policy maps states into actions the agent ought to take in those states
- The goal of RL is to learn a policy that maximizes the sum of future reinforcements
- The problem of delayed reinforcement (the temporal credit assignment problem): how to choose an action in a state where there will be no reinforcement
Learning
- Long-term memory: the agent's record of what it has learned about the environment
- Learning: updating the long-term memory on the basis of new information from the environment
- The importance of learning in small steps, controlled by the learning rate
How the agent uses its knowledge
- Exploitation: relying on what's been learned
- Exploration: behaving stochastically; trial-and-error

Q learning

Q learning tries to find a value for each pair of a state (s) and an action (a).
- Each Q-value is a function of a state and an action: `Q(s, a)`.
- The Q-value for a given s, a pair needs to be based on the reinforcement that the environment gives the agent when the agent performs a in s (that is, what the environment's reinforcement function returns).
- But sometimes there's no reinforcement; instead an action may make a future reinforcement possible.
- So the Q-value also needs to somehow take into account what happens in the next state (that is, what the environment's next-state function returns).
The trick
- The basic idea: take the Q-value to be the sum of all future reinforcements received by the agent when taking that action and then following its policy.
- The agent can't know this, so it has to estimate it, treating the Q-value of a state-action pair recursively. The value is (a function of)
  - the immediate reinforcement received and
  - (what the agent currently believes to be) the value of the best action in the new state that is reached.
- So at any given point during learning, the agent has an estimate of the values for all state-action pairs. On a given learning trial, the agent is in state `s_t`. It does the following.
  - Selects an action `a_t` and executes it.
  - Notes whatever reinforcement it receives as a result: `r(s_t, a_t)`
  - Notes what new state it is in: `s_{t+1}`
  - Finds what it believes to be the value of the best action in the new state: `Q^{best}(s_{t+1}, a)`
  - Combines 2 and 4 to update its estimated value for the pair `s_t`, `a_t`
Discounting the next state estimate
- Just adding the reinforcement to the estimated next state value causes problems if the future reinforcements are infinite.
- Usually it is the discounted best value of the next state that is added, that is, that estimate times a discount rate γ.
- So we could update the Q-value for a state-action pair using this update equation:
  `Q(s_t, a_t) = r(s_t, a_t) + gamma max_{a_(t+1)} Q(s_{t+1}, a_{t+1})`
  The expression beginning with "max" means for all of the possible actions in the next state, `a_(t+1)`, the highest of the Q values that the agent has stored for that state in its LTM.
Learning rate
- But the information we get on a given learning trial may be misleading, so this update equation moves too quickly.
- Instead we usually move relatively slowly in the direction indicated by the current evidence; that is, there is a learning rate (η), which controls the step size of the learning. With this, the update equation is:
  `Q^{t+1}(s_t, a_t) = (1-eta) Q^t(s_t,a_t) + eta [r(s_t, a_t) + gamma max_{a_(t+1)} Q^t(s_{t+1}, a_{t+1})]`
  This equation combines the old knowledge that the agent has with the new information coming from the current experience of receiving a reinforcement and ending up in a new state.

An example of Q learning

s a r Q new

1,r 1,s 2,r 2,l 3,r 3,l G,s

1 r 0 0 0 0 0 0 0 0

2 r 0

3 r 1

G s 0

1 r 0

2 r 0

3 r 1

G s 0

γ = .8, η = .5 and all values initialized at 0. In the chart, "new" means the reinforcement received plus the discounted maximum value of the next state. The "new" value is combined with the "old" using the learning rate to give the updated Q-value appearing in the next line of the chart.

s	a	r	Q							new
s	a	r	1,r	1,s	2,r	2,l	3,r	3,l	G,s	new
1	r	0	0	0	0	0	0	0	0	0
2	r	0	0	0	0	0	0	0	0	0
3	r	1	0	0	0	0	0	0	0	1
G	s	0	0	0	0	0	.5	0	0	0
1	r	0	0	0	0	0	.5	0	0	0
2	r	0	0	0	0	0	.5	0	0	.4
3	r	1	0	0	.2	0	.5	0	0	1
G	s	0	0	0	.2	0	.75	0	0	0

Reinforcement learning: making decisions

How is the agent to pick an action?
The exploration-exploitation tradeoff
- Exploitation: pick the action that has the highest Q-value for the current state
- But the agent can only learn about the value of actions that it tries; it should try a variety of actions, including actions other than the ones it thinks are best: exploration.
- Exploration makes more sense early on in learning when the agent doesn't know much; the probability of exploration should decrease with experience.
Basing the decision on the relative Q values of actions
- The probability of picking an action depend on how high its value is relative to the values of all of the other possible actions: the Luce choice rule.
- A way to implement this: a separate exploitation probability for each action `a^i` (`E` is a parameter controlling the exploration-exploitation tradeoff):
  `P_{text(exploit)}(a_t^i) = (e^{E * Q(s_t, a_t^i)}) / (sum_a e^{E * Q(s_t,a_t)})`

An example from Levin (2000)

Tutorial task: system must get the month and day from the user and then close the conversation
Possible actions:
- Ask for month
- Ask for day
- Ask for date (month and day)
- Close conversation
Reinforcement (cost)
- Number of interactions
- Number of slots left unfilled
- Number of recognition errors
Reinforcement learning for the task

s	a	r	Q							new
s	a	r	1,r	1,s	2,r	2,l	3,r	3,l	G,s	new
1	r	0	0	0	0	0	0	0	0	0
2	r	0	0	0	0	0	0	0	0	0
3	r	1	0	0	0	0	0	0	0	1
G	s	0	0	0	0	0	.5	0	0	0
1	r	0	0	0	0	0	.5	0	0	0
2	r	0	0	0	0	0	.5	0	0	.4
3	r	1	0	0	.2	0	.5	0	0	1
G	s	0	0	0	.2	0	.75	0	0	0

s	a	r	Q							new
s	a	r	1,r	1,s	2,r	2,l	3,r	3,l	G,s	new
1	r	0	0	0	0	0	0	0	0	0
2	r	0	0	0	0	0	0	0	0	0
3	r	1	0	0	0	0	0	0	0	1
G	s	0	0	0	0	0	.5	0	0	0
1	r	0	0	0	0	0	.5	0	0	0
2	r	0	0	0	0	0	.5	0	0	.4
3	r	1	0	0	.2	0	.5	0	0	1
G	s	0	0	0	.2	0	.75	0	0	0

s	a	r	Q							new
s	a	r	1,r	1,s	2,r	2,l	3,r	3,l	G,s	new
1	r	0	0	0	0	0	0	0	0	0
2	r	0	0	0	0	0	0	0	0	0
3	r	1	0	0	0	0	0	0	0	1
G	s	0	0	0	0	0	.5	0	0	0
1	r	0	0	0	0	0	.5	0	0	0
2	r	0	0	0	0	0	.5	0	0	.4
3	r	1	0	0	.2	0	.5	0	0	1
G	s	0	0	0	.2	0	.75	0	0	0