Email me your solution (one Python file, one text file describing your results) by Sunday, March 2, 11:59pm.
In unsupervised word sense disambiguation, we attempt to disambiguate words using information contained in dictionaries or thesauruses. The Lesk algorithm associates a "signature" with each sense of a word; that is, a bag of words which in the simple version of the algorithm comes from the sense's definition (and possibly also examples) from its entry in a dictionary or thesaurus. Then the context of an instance of the word (its sentence or a window of words surrounding it) is compared with each of the signatures; the sense is selected whose signature overlaps most with the context. Normally stop words are eliminated from both signatures and contexts.
An improved version of the algorithm, the corpus Lesk algorithm, relies on a number of labeled senses from a corpus. The signatures for each sense are augmented with the contexts (sentences or context windows) of that sense from the corpus. Then for an instance of an ambiguous word, the algorithm picks the sense with the greatest overlap between the instance's context and the sense's augmented signature. Instead of simply dropping stop words in the calculation of overlap, corpus Lesk weights each of the overlapping words by its inverse document frequency (IDF), that is, the log of the ratio of the number of documents to the number of documents containing the word. The IDF for words usually found on stop word lists should be low because these words tend to occur in many documents.
Make sure you have installed NLTK.
We will be using the Brown corpus and WordNet modules in NLTK.
This file gives you the import statements you will need to include, as well as two sequences that will help you.
The Brown corpus contains a wide variety of written English, divided into 15 documents.
Once you have imported the corpus (from nltk.corpus import brown),
you can get the names of its documents with brown.items.
There are several ways to access the elements in the corpus; for our purposes, the following are the relevant ones.
Each can take a document name or a list of document names as its argument;
if none is given, it returns elements from the entire corpus.
brown.words(): returns a list of wordsbrown.sents(): returns a list of lists of wordsbrown.paras(): returns a list of lists of lists of wordsNotice that punctuation marks are treated as words.
Wordnet is organized in terms of synsets, which represent relatively fine-grained word senses.
Once you have imported Wordnet (), you can get the synsets for a noun such as suit like this (use V for verbs and for adjectives and adverbs instead of N): N['suit'].synsets().
Each synset (among many other things) has an associated definition and usually one or more example sentences.
You can get this with synset.gloss.
Note that what gets returned is just a long string, for example,
'a set of garments (usually including a jacket and trousers or skirt) for outerwear all of the same fabric and color; "they buried him in his best suit"'