Homework 4: Word sense disambiguation

Email me your solution (one Python file, one text file describing your results) by Sunday, March 2, 11:59pm.

Lesk's algorithm

In unsupervised word sense disambiguation, we attempt to disambiguate words using information contained in dictionaries or thesauruses. The Lesk algorithm associates a "signature" with each sense of a word; that is, a bag of words which in the simple version of the algorithm comes from the sense's definition (and possibly also examples) from its entry in a dictionary or thesaurus. Then the context of an instance of the word (its sentence or a window of words surrounding it) is compared with each of the signatures; the sense is selected whose signature overlaps most with the context. Normally stop words are eliminated from both signatures and contexts.

An improved version of the algorithm, the corpus Lesk algorithm, relies on a number of labeled senses from a corpus. The signatures for each sense are augmented with the contexts (sentences or context windows) of that sense from the corpus. Then for an instance of an ambiguous word, the algorithm picks the sense with the greatest overlap between the instance's context and the sense's augmented signature. Instead of simply dropping stop words in the calculation of overlap, corpus Lesk weights each of the overlapping words by its inverse document frequency (IDF), that is, the log of the ratio of the number of documents to the number of documents containing the word. The IDF for words usually found on stop word lists should be low because these words tend to occur in many documents.

What you have to do

Make sure you have installed NLTK. We will be using the Brown corpus and WordNet modules in NLTK. This file gives you the import statements you will need to include, as well as two sequences that will help you.

The Brown corpus contains a wide variety of written English, divided into 15 documents. Once you have imported the corpus (from nltk.corpus import brown), you can get the names of its documents with brown.items. There are several ways to access the elements in the corpus; for our purposes, the following are the relevant ones. Each can take a document name or a list of document names as its argument; if none is given, it returns elements from the entire corpus.

Notice that punctuation marks are treated as words.

Wordnet is organized in terms of synsets, which represent relatively fine-grained word senses. Once you have imported Wordnet (), you can get the synsets for a noun such as suit like this (use V for verbs and for adjectives and adverbs instead of N): N['suit'].synsets(). Each synset (among many other things) has an associated definition and usually one or more example sentences. You can get this with synset.gloss. Note that what gets returned is just a long string, for example, 'a set of garments (usually including a jacket and trousers or skirt) for outerwear all of the same fabric and color; "they buried him in his best suit"'

  1. Select two or three common senses (synsets) of an ambiguous English noun like suit or chair. Make signatures for these senses from their Wordnet glosses, deleting punctuation and stop words.
  2. Find at least 20 sentences in which the noun occurs in one of the selected senses in the Brown corpus. Hand-label the examples with their senses.
  3. For 10 of the sentences, use the simple version of the Lesk algorithm to disambiguate the nouns.
  4. Now use the other sentences from the corpus to augment the signatures for the senses, deleting punctuation, but not stop words.
  5. For the first 10 sentences, use the corpus Lesk algorithm with the augmented signatures to disambiguate the nouns, weighting the words in the overlapping set by their inverse document frequency (IDF). As a crude way to calculate the IDF for a word, you can use the 15 documents within the Brown corpus as documents. For example, the word brain appears in 12 of the 15 documents, so its IDF is log 15.0/12.0 = 0.3219.
  6. Compare the two algorithms, and discuss what might improve the performance further.

Home

Calendar

Coursework

Notes

Readings

Code

Resources


IU | INFO | CSCI

Contact instructor