B551
http://www.cs.indiana.edu/~gasser/B551_F09/
October 13, 2009
Sophisticated information retrieval
- What are the major isotopes of molybdenum?
- What's wrong the health reform bill being voted on today in Congress?
- What's the gist of Paul Krugman's article in the October 12th New York Times?
- Is there an internet cafe in the city of Aksu, China?
- This website is in Arabic; translate it for me.
Some properties of natural language
- A text to look at
- Language is often ambiguous.
- Related words may be separated by arbitrary numbers of other words.
- Who did you tell me you were going to recommend that she talk to?
- The mapping between form and meaning is very complex.
- Two sentences very similar in meaning can be superficially (at least) very different.
- Lois hugged Clark.
- Clark was embraced by Lois.
- Two sentences very different in meaning can be superficially very similar.
- Clark meets Lois.
- Clark respects Lois.
- The relationship between speech and text is very complex.
- Languages differ considerably in
- how they divide up reality
- what they lexicalize
- Malay: berseluar 'wear pants' (from seluar 'pants')
- Amharic: ተግደረደረ tǝgdǝrǝddǝre 'pretend to have had enough of something when offered more'
- how they allocate structural properties to words (morphology) or sentences (syntax)
- Amharic: የምትከባበሩልኝ yǝmmɨttɨkkǝbabbǝrullɨñ ‘[you(pl.)] who respect each other for me’ (kbr 'respect')
- A multilingual text; the text aligned
- Language has both rule-governed and statistical properties.
Dealing with structural complexity: grammars
- Grammars: representations of the possible structural relationships between words (or morphemes within words)
- Dependency grammars: relate words to each other with arcs with structural labels
- A dependency grammar analysis
- A dependency grammar
- A list of possible arc labels.
-
- The lexicon: repository of words, each with associated grammatical information
- ran
- Must have an outgoing sbj arc
- May have one or more outgoing pmod arcs
- The dependent on the sbj arc precedes self
- The dependent on any pmod arc follows self
- runs
- ...
- The dependent on the sbj arc must have the properties [person=3, number=singular]
- the
- Must have an incoming det arc
Parsing
- Finding an appropriate analysis for a sentence
- Parsing as search
- Dependency parsing as constraint satisfaction
- Create a variable for each pair of words; the domain is the set of arc labels (with
None as one possible value).
- Find the entries in the lexicon for each word in the sentence.
- Instantiate constraints based on what's in the lexical entries.
- Run constraint satisfaction, returning the best analysis or all analyses satisfying the constraints.
Dealing with statistical properties of language: language models
- What is the probability that a sequence of words will occur?
- Given a sequence of words, what is the probability that a particular word will occur next?
- The probability of a sequence of words (using the chain rule of probability)
`P(w_1^n) = P(w_1)P(w_2|w_1)P(w_3|w_1^2)...P(w_n|w_1^(n-1))`
`= prod_(k=1)^n P(w_k|w_1^(k-1))`
- We can approximate the product by considering only subsequences of a particular length N.
`P(w_n|w_1^(n-1)) ~~ P(w_n|w_(n-N+1)^(n-1))`
- For N=2, get get
`P(w_1^n) ~~ prod_(k=1)^(n) P(w_k|w_(k-1))`
- Estimating the probabilities from counts (C) in a corpus, for N=2
`P(w_n|w_(n-1)) = (C(w_(n-1)w_n)) / (C(w_(n-1)))`
- Using a language model in translation
- Nunca sabrá quien barrió sus hojas. →
Never he will know who raked his leaves.
He never will know who raked his leaves.
He will never know who raked his leaves.
He will know never who raked his leaves.
Dealing with ambiguity
- Is there a way to handle ambiguity without sophisticated linguistic information?
- Word sense disambiguation
- The judge threw out the suit against the doctor.
- He prefers to wear a suit made of wool.
- The next player has to play a card of the same suit.
- The new degree program won't suit all computer science students.
- Words can be disambiguated by their context.
But what features of the context are relevant?
- Some possible binary features
- The word occurs in the same window of N words with the word judge.
- The word occurs in the same window of N words with the word card.
- The word follows the word won't.
- The word precedes the word against.
- The word follows a noun.
- Given a set of features, each occurrence of the word is characterized by a feature vector.
- Given a set of examples of the word labeled for their senses, we can train a classifier
to assign instances of the word to senses.
-
The classifier learns to weight each of the features in terms of which sense(s) it favors.
- The classifier can then be applied to unlabeled test data.