Homework 6: N-Grams

Email me your solution (one file with your program and results) by Sunday, April 27, 11:59pm.

What you have to do

Make sure you have installed NLTK (version 0.9.1 or later).

Using the Brown corpus and trigrams, estimate the log probability of the sentence it's the end of the world as we know it. Treat the period (full stop) at the end as a word, and add a beginning-of-sentence tag to the beginning of the sentence. The probability of the sentence is then the probability of the beginning-of-sentence tag times the bigram probability of the first word, given the beginning-of-sentence tag, times all of the trigram probabilities:

P(<s>) P("it's"|<s>) P("the"|<s>,"it's") ... P("."|"know it")

Use a simple backoff approach to sparse data. If a trigram occurs less than 3 times in the corpus, instead use the bigram you get by dropping the first word of the trigram. If the bigram occurs less than 3 times, use the count for the last word. If this does not occur, use 1.

To use the Brown corpus, start with

from nltk.corpus import brown

Then to access a list of all of the words or a list of sentences, each a list of words, do brown.words() or brown.sents(). Note that these do not actually return lists; they return "views" on the corpus text that permit functions such as len() and count() and access to list elements in the conventional way but don't require the entire corpus to be stored in memory at once.

WORDS = brown.words()
>>> len(WORDS)
1161192
>>> WORDS.count('liver')
16
>>> WORDS[1000:1005]
['race', ',', 'a', 'top', 'official']

Home

Calendar

Coursework

Notes

Readings

Code

Resources


IU | INFO | CSCI

Contact instructor