Background for example-based MT

Requires a database of parallel sentences (the system's KB)
Relies on analogy between input sentence `S_r` and examples in the database
`S_i,S_j,...` : `S_r` :: `T_i,T_j,...` : ?
(the source sides of the matching pairs are to the input sentence as the target sides of the matching pairs are to the target sentence)
Related to case-based reasoning in AI
Related to translation memory systems
May store parsed sentences or generalized examples
Relatively theory-neutral (in comparison to rule-based MT)

The basic steps

Matching: identify sentence(s) in database that match the input on the basis of fragments of the input.
Alignment/adaptation: identify corresponding fragments in target-language sentences.
Recombination: recombine target-language fragments to form output.

An example

I haven't read the book that you lent me.

Matching
- I haven't done the task that you assigned me.
  Je n'ai pas fait le tâche que tu m'as confié.
- I haven't opened the present that you sent me.
  Je n'ai pas ouvert le cadeau que tu m'as envoyé.
- You haven't read the book.
  Tu n'as pas lu le livre.
- He hasn't read the book.
  Il n'a pas lu le livre.
- Why don't you return the DVD that I lent you?
  Pourquoi tu ne me rends pas le DVD que je t'ai prêté?
- Where is the painting that Frank lent us?
  Où est la peinture que Frank nous a prêté?
Alignment/adaptation
- I haven't ... the ... that you ... me.
  Je n'ai pas ... le ... que tu m'as ...
- ... read the book
  ... lu le livre
- ... the ... that ... lent ...
  ... le/la ... que ... prêté
Recombination

Je n'ai pas le que tu m'as

lu le livre

le/la que prêté

Matching

The problem: Given an input source language sentence `S_r`, find one or more translation pairs `S_i, T_i` such that `S_i` matches `S_r`.
Similarity measures; how do we measure the similarity of `S_i` to `S_r`?
Simple matching: look for sentences `S_i` containing one or more words matching words in `S_r` and in the same positions relative to one another.
Category matching: given generalizations over words, look for sentences `S_i` containing one or more words or categories that match the words or categories in `S_r` and in the same positions relative to one another.
Structure matching: given parsed translation pairs, look for sentences `S_i` containing structure (subtrees) that match structure in `S_r`.

Matching: some problems

- They gave up the plan that was proposed.
  They gave up the plan.
  They gave the plan up.
  They gave it up.
- Varying gap lengths between matched elements
- They approved the recommendation.
  They rejected the proposal.
- Data sparsity and lexical categories

Learning generalizations from the database (Brown, 2001)

Some `S`-`T` pairs:
- The doctor treated the patient .
  El médico trató al paciente.
  ሐኪሙ ሕመምተኛውን አከመው ።
- The doctor examined the patient .
  El médico examinó al paciente.
  ሐኪሙ ሕመምተኛውን መረመረው ።
- The doctor anesthetized the patient .
  El médico anestesió al paciente.
  ሐኪሙ ሕመምተኛውን አስተኛው ።
- The doctor examined the child .
  El médico examinó al niño.
  ሐኪሙ ልጁን መረመረው ።
- The cardiologist treated the patient . ...
- The doctor examined the patient's liver . ...
- The cardiologist treated the patient's heart . ...
- The patient recovered in the hospital . ...
- The patient recovered at home . ...
- The patient recovered unexpectedly . ...
- The patient recovered in two days . ...

Learning generalizations from the database (Brown, 2001)

Given two or more source-language sentences in the database with shared material, find the different parts and the corresponding different parts in the target sentences, and treat these paired segments as a category, representing them in a rewritten database with category labels.
- The doctor treated the patient .
  El médico trató al paciente.
  ሐኪሙ ሕመምተኛውን አከመው ።
  The doctor examined the patient .
  El médico examinó al paciente.
  ሐኪሙ ሕመምተኛውን መረመረው ።
- CAT1 → treated/trató/አከመው; examined/examinó/መረመረው
- CAT2 → the doctor/el médico/ሐኪሙ; ...
- The CAT2 CAT1 the patient's liver . ...
  The CAT2 CAT1 the patient's heart . ...
- CAT3 → liver/hígado/ጉበት; heart/corazón/ልብ
- The patient recovered in the hospital . ...
  The patient recovered recovered at home . ...
  The patient recovered recovered unexpectedly . ...
  The patient recovered recovered in two days . ...
- CAT4 → in the hospital/...; at home/...; unexpectedly/...; in two days/...

Alignment

With no other knowledge, we have to rely on what is shared between multiple pairs of sentences.
- I haven't done the task that you assigned me.
  Je n'ai pas fait le tâche que tu m'as confié.
- I haven't opened the present that you sent me.
  Je n'ai pas ouvert le cadeau que tu m'as envoyé.
An additional helpful resource is a bilingual dictionary, even if incomplete.
- task: tâche
- you: tu
- I: je
- that: que
- do (does, did, done): faîre (fait, AVOIR fait, fait)
It is possible to learn a primitive bilingual dictionary from the corpus.

Recombination

Given the target sentence fragments from the matched translation pairs, assemble the fragments into the complete target sentence.
But how do we assemble the target fragments?
Overlapping words in different fragments can help to align those fragments. See que in this example.
The positions of the fragments in the target sentences of the translation pairs can give a clue about their position in the final target sentence, especially for long fragments, even when this position is different from the corresponding source language position.
`S_r`: The authorities can't explain where it went.
`S_1`: I don't know where it went.
`T_1`: የት እንደሄደ አላውቕም።
`S_2`: She didn't say where it went.
`T_2`: የት እንደሄደ አልነገረችም።
At the boundaries between fragments there may be errors due to distinctions made in the target language but not made in the source language (boundary friction); see the translation of the in this example, which in French varies with the gender of the noun.
To deal with boundary friction, we can use information about which words belong together in the target language, e.g., articles go with nouns in French.

Open-source resource for EBMT

CMU-EBMT (Brown, 2011)

Statistical Machine Translation: the noisy channel model

The noisy channel (source-channel) approach (familiar from speech recognition, used in the IBM models): target language sentence (T) as a hidden "source" for the source language input sentence (S)

`hat T` `= text{argmax}_T\ P(T|S)`

`= text{argmax}_T\ (P(T)\ P(S|T)) / (P(S))`

`= text{argmax}_T\ P(T)\ P(S|T)`
The language model handles `P(T)`, the translation model handles `P(S|T)`.

SMT training: the target language model

The target language model represents the simple statistical properties of the target language, independent of the source language: `P(T)`
Because `P(T|S)` depends on `P(T)` as well as `P(S|T)`, the language model can correct for errors in the translation model.
n-gram language models

SMT: IBM Model 3 parameters

Translation probabilities
- For each target language word t_i, the probability that it corresponds to a source-language word s_j: `tau(s_j|t_i)`. (Note: translation probabilities are represented with t in Brown et al. (1993) and some other SMT papers.)
Dealing with differences in word order
- Distortion probabilities: probability that a source-language word will appear in position p_s given position p_t for the corresponding target-language word and source and target sentence lengths l and m: `d(p_s|p_t,l,m)`
Dealing with differences in number of words
- Each target-language word has a fertility, a probability associated with each number φ of source-language words it corresponds to: `n(phi|t_i)`
- NULL: a unique target-language word corresponding to one or more source-language words

SMT training: alignment

Alignment: association of positions in target and source sentences
Given values for the parameters, we can find the best alignment for any two corresponding sentences in the corpus
Given probabilities for alignments for two sentences, we can estimate the translation, fertility, and distortion parameters
The probability of a source sentence, given a target sentence, is the sum of the probabilities of the different ways of aligning source (S) and target (T) sentences
`P(S|T) = sum_a P(a,S|T)`
A generative "story" for getting a source sentence given a target sentence
Starting with a target sentence T,
1. pick a length m for the source sentence
2. for each position j in the source sentence, pick an associated alignment position in the target sentence `a_j` (based on m, the source language words in the previous positions and the associated alignment positions in the target sentence for the previous positions)
3. for each position in the source sentence, pick a source language word (`s_j`) (based on m, the target sentence, the previous source language words, and the alignment positions for the source language positions up to and including j)
The equation behind the story (Brown et al., 1993, Equation (4))
`a_1^j` represents the positions in the source sentences aligned with positions 1 to j in the target sentence
(1) `P(a,S|T) = P(m|T) prod_{j=1}^{m} P(a_j|a_1^{j-1},s_1^{j-1},m,T) P(s_j|a_1^j,s_1^{j-1},m,T)`

SMT training: IBM Model 1

There are too many parameters to estimate on the right-hand side of Equation (1), so we need to simplify.
IBM Model 1 simplifications:
- First term in (1): `P(m|T)`. The probability of the source language length m is a constant `epsilon`. That is, all lengths have equal probability.
- Second term in (1): `P(a_j|a_1^{j-1},s_1^{j-1},m,T)`. The probability of a source position being associated with a target position depends only on the length of the target sentence l: `1/(l+1)`.
- Third term in (1): `P(s_j|a_1^j,s_1^{j-1},m,T)`. The probability of a source word in a given position depends only on that word and the target word in the corresponding position in the alignment, `t_{a_j}`. This is just the translation probability `tau(s_j|t_{a_j})`.
Model 1 avoids distortion and fertility probabilities. The probability of an alignment and a source sentence is simply a function of the translation probabilities of corresponding words (Equation (5) in Brown et al. (1993)).
(2) `P(a,S|T) = epsilon/(l+1)^m prod_{j=1}^{m} tau(s_j|t_{a_j})`
Summing over all of the possible alignments, we get an equation for the expression we want to maximize (Equations (6) and (15) in Brown et al. (1993)).
(3) `P(S|T) = epsilon/(l+1)^m sum_{a_1 = 0}^l ... sum_{a_m=0}^l prod_{j=1}^{m} tau(s_j|t_{a_j}) = epsilon/(l+1)^m prod_{j=1}^m sum_{i=0}^l tau(s_j|t_i)`
(The last expression represent a short cut, which we will not be using in the examples on the following pages because the process is less intuitive with the shortcut.)
We are left with the problem of estimating the translation probabilities for all pairs of source and target words so that (3) is maximized.
The "story" behind the mathematics:
1. Given the target sentence T, pick a length for the source sentence S. All reasonable lengths are equally likely.
2. For each position j in S, decide how to connect it to T and what source word to place there. All connections to positions in T are equally likely. (The order of the words in S and T does not affect `P(S|T)`.)

SMT: estimating translation probabilities

There is no analytical solution to the problem of maximizing the left-hand side of Equation (3) (see Brown et al. (1993), p. 271).
An alternative is an iterative solution, using expectation maximization (EM).
Using EM, we alternate updating the translation probabilities and the probabilities of each alignment until the translation probabilities converge.
The E-step: estimating the probabilities of alignments
- The probability of an alignment is the probability of the alignment and the source sentence normalized by the probabilities of all possible alignments
  `P(a_i|S,T) = (P(a_i,S,T)) / (P(S,T)) = (P(a_i,S|T) P(T)) / (P(S|T) P(T)) = (P(a_i,S|T)) / (sum_j P(a_j,S|T))`
- From Equation (2), we can calculate the expression on the right.
The M-step: re-estimating the translation probabilities
- We make use of the notion of fractional counts, the number of times we expect to see a given source word and target word together in the corpus.
  `c(s,t)`: the fractional count for words s and t
  `c(s,t;a)`: the product of the number of times s and t are aligned in alignment a
  `c(s,t) = sum_sigma^Sigma sum_k P(a^k|S_sigma,T_sigma) * c(s,t;a^k)`
  That is, we sum over all of the alignments in which the two words occur, weighting by the alignment probability.
- Then to get the translation probabilities, we normalize the fractional counts by the total fractional counts over all source words for the target word.
  `tau(s|t) = (c(s,t)) / (sum_j c(s_j,t))`

Estimating Model 1 probabilities: an example

Using expectation-maximization to estimate the t parameters; a very simple example (s: English, t: French)
Simplifications: assume (1) all source words align to some target word (there is no NULL target word) and (2) alignment probabilities do not depend on sentence length
Initialization of translation probabilities
`tau(text(book)|text(livre)) = .5, tau(text(the)|text(livre)) = .5, tau(text(book)|text(le)) = .5, tau(text(the)|text(le)) = .5`
Iteration 1
- `P(1,text(book)|text(livre)) = .5, P(2,text(the book)|text(le livre)) = .5 * .5 = .25, P(3,text(the book)|text(le livre)) = .5 * .5 = .25`
- `P(1|text(book),text(livre)) = .5 / .5 = 1, P(2|text(the book),text(le livre)) = .25 / .5 = .5, P(3|text(the book),text(le livre)) = .25 / .5 = .5`
- `c(text(book),text(livre)) = 1 + .5 = 1.5, c(text(the),text(livre)) = .5, c(text(book),text(le)) = .5, tau_c(text(the)|text(le)) = .5`
  `tau(text(book)|text(livre)) = 1.5/2 = .75, tau(text(the)|text(livre)) = .5/2 = .25, tau(text(book)|text(le)) = .5 / 1 = .5, tau(text(the)|text(le)) = .5 / 1 = .5`
Iteration 2
- `P(1,text(book)|text(livre)) = .75, P(2,text(the book)|text(le livre)) = .5 * .75 = .375, P(3,text(the book)|text(le livre)) = .25 * .5 = .125`
- `P(1|text(book),text(livre)) = .75 / .75 = 1, P(2|text(the book),text(le livre)) = .375 / .5 = .75, P(3|text(the book),text(le livre)) = .125 / .5 = .25`
- `c(text(book),text(livre)) = 1 + .75 = 1.75, c(text(the),text(livre)) = .25, c(text(book),text(le)) = .25, c(text(the),text(le)) = .75`
  `tau(text(book)|text(livre)) = 1.75/2 = .875, tau(text(the)|text(livre)) = .25/2 = .125, tau(text(book)|text(le)) = .25 / 1 = .25, tau(text(the)|text(le)) = .75 / 1 = .75`

21st century SMT

Log-linear model usually replaces the noisy channel model
Phrases, rather than words, are the basic units
Syntax may be incorporated, either in the form of a parser for the source language or as a synchronous grammar that is learned from the corpus

Log-linear (maximum entropy) models

In the noisy channel approach, we decompose `P(T|S)` into a target language model and a (reverse) translation model, but this is inflexible because it doesn't allow us to include arbitrary other functions of the source and target, for example, a translation model in the other direction, `P(T|S)`.
Furthermore, the noisy channel approach does not permit us to weight the different contributions of the language model and the translation model.
An alternative: the maximum entropy or log-linear approach (Berger et al., 1996; Och & Ney, 2002)
We can model `P(T|S)` directly using a linear combination of arbitrary functions (feature functions) of `S` and `T`, as well as in some models other variables:
`sum_(m=1)^M lambda_m h_m(T,S)`
Since the feature functions can be almost anything we like, they may return any sorts of values, so the sum may also take on any value, positive or negative. An exponential function can make it positive and normalizing by the sum over all possible value of `T` constrains it to be between 0 and 1:
(4) `P(T|S) = {text(exp) [sum_(m=1)^M lambda_m h_m(T,S)]} / (sum_(T prime) text(exp) [sum_(m=1)^M lambda_m h_m(T prime,S)]) = P_{lambda_1^M}(T|S)`
Because the denominator of (4) is always the same, we can ignore it in the decision rule:
(5) `hat T = text{argmax}_T P(T|S) = text{argmax}_T sum_(m=1)^M lambda_m h_m(T,S)`
Given a parallel corpus of sentence pairs `T_sigma, S_sigma`, we want model parameters `lambda_m` that satisfy
(6) `hat lambda_1^M = text{argmax}_(lambda_1^M) {sum_sigma^Sigma log P_{lambda_1^M}(T_sigma|S_sigma)}`
Model parameters must be estimated using numerical methods. For one algorithm, see Berger et al., 1996. For a discussion of and reference to another, see Och and Ney, 2002.
Feature functions can also be functions of other variables, for example, the alignment between `S` and `T`

Phrase-based and syntactic SMT

Advantages of dealing with phrases
- Sentences have internal structure
- In translation phrasal units in one language correspond to phrasal units in the other language
- Languages have massive amounts of partial idiomaticity; the meaning (and translation) of a group of words may not be predictable from the component words and the structure of the phrase
- Working with frequent chunks can save processing time because we can skip the step involving translation of individual words
Approaches
- Learning phrasal units (e.g., Och and Ney)
- Learning synchronous grammar (e.g., Chiang)

Phrases

Alignment

Training word-word alignments in both directions

Combining the results of both alignments: union, intersection, smarter combination

pas

livre

que

prêté

have

not

read

the

book

that

you

lent

Extracting phrases
- Consistent phrase pair: all words aligned only with one another, not with any phrase-external words
- Estimating phrase translation probabilities
- "Phrases" will not necessarily correspond to linguistic phrases

SMT: (phrase-based) decoding

We can't consider all possible target sentences (if we do, MT is NP-complete), so some sort of heuristic is required.
Decoding: search for the best target sentence
A*: best-first search guided by a combination of the current cost and future cost.
For the target sentence consider only phrases which are possible translations of possible source-language phrases (have significant translation probabilities).
Search state representation: target words generated (in sequence), source words covered, cost accumulated
Cost: combination of current cost and estimated future cost
- For each sub-sentence hypothesis, we calculate a probability based on the learned translation, distortion, and fertility parameters and the language model; this is the current cost.
- From each hypothesis, we generate longer hypotheses, estimating the future cost for each: the probability of producing the target that maximizes the translation probabilities for the source words and the target language model (distortion is ignored).
Initial search state: no source words, no target words.
Expand a state (generate new hypotheses): select source-language phrases that could yield target-language phrases in the next output position.
Hypotheses that overlap are combined; reduces the search space.
Beam search: hypotheses are pruned based on their cost (unlike in actual A*, where no hypotheses are discarded).
Multiple queues are maintained, one for each number of source words covered in the hypothesis; beam search within each queue

Decoding: a partial example

SMT: evaluation

Translation output candidate and translation references
Precision vs. recall measures
mWER: multi-reference word error rate: edit distance of the candidate from the most similar reference (percentage of words to be inserted, deleted or substituted in order to obtain reference)
`WER = (S+D+I)/N`
BLEU (Papineni et al., 2002)
- Modified precision
- For each n-gram in translation, clip the number of occurrences of the n-gram in the candidate by the maximum number of occurrences of the n-gram in the references, and divide this number by the total number of n-grams in the candidate.
- For each n, an n-gram score is calculated for the whole corpus, using the clipped n-gram counts for each sentence and the total number of n-grams in the corpus of candidates.
- N-gram scores are combined using a weighted average of the logarithms of the scores for each n: the geometric mean.
- Candidate sentences much shorter than the shortest reference sentence are penalized; the corpus score is multiplied by a brevity penalty factor.

Background for example-based MT

The basic steps

An example

Matching

Matching: some problems

Learning generalizations from the database (Brown, 2001)

Learning generalizations from the database (Brown, 2001)

Alignment

Recombination

Open-source resource for EBMT

Statistical Machine Translation: the noisy channel model

SMT training: the target language model

SMT: IBM Model 3 parameters

SMT training: alignment

SMT training: IBM Model 1

SMT: estimating translation probabilities

Estimating Model 1 probabilities: an example

21st century SMT

Log-linear (maximum entropy) models

Phrase-based and syntactic SMT

Phrases

SMT: (phrase-based) decoding

Decoding: a partial example

SMT: evaluation

Open-source software toolkits for SMT

Munteanu and Marcu (2005)

Je n'ai pas		le		que	tu	m'as
	lu	le	livre
		le/la		que			prêté

`hat T`	`= text{argmax}_T\ P(T\|S)`
	`= text{argmax}_T\ (P(T)\ P(S\|T)) / (P(S))`
	`= text{argmax}_T\ P(T)\ P(S\|T)`