The problem
- For most language pairs, there is in adequate parallel data to train SMT systems
- Comparable corpora: corpora in two languages that are likely to include sentences that are translations of each other
- Somehow use the comparable corpora to locate translated sentence pairs
- Train an SMT system on small amount of parallel data
- For each L1 sentence in a large comparable corpus pair, translate the sentence to L2 using the MT system.
- Use the translated L2 sentences as IR queries for the L2 side of the comparable corpus pair, restricting the search to documents within a time window around the date of the L1 document.
- Filter the returned L2 sentence candidates using WER and TER (comparing the translated L2 sentence and the candidate L2 sentences in the corpus).
- Trim the ends of L2 candidates using WER.
- Add the selection L2 sentences and their corresponding L1 sentences to the training set.
- Improvement of BLEU score on test set of almost 2.5 points.
- Increasing size of training set for original SMT system does not affect performance.