Email me your solution (one file with your grammar and discussion of results) by Sunday, March 23, 11:59pm.
Using regular expressions and a corpus with part-of-speech tags, it is possible to perform some shallow parsing, which may be suitable for some purposes. For example, we could find many simple noun phrases in English sentences by looking for ("chunking on") sequences of zero or one determiner (the, etc.), followed by zero or more adjectives (small, etc.), and a noun. We could find more complex noun phrases by chunking on prepositional phrases (in the street, etc.) and relative clauses (that we ate, etc.) and then chunking on the noun phrases which have these patterns after the noun.
NLTK makes it easy to create chunking grammars. Read 7.1, 7.2, 7.3, and 7.6 from the chapter on chunking in the NLTK book. These sections contain more information about chunking, as well as about how to use NLTK to create a chunking grammar.
Make sure you have installed NLTK (version 0.9.1 or later). Download these two files, which include some tagged Swahili sentences and a little code to help you.
The tags in sentences are NN (noun), JJ (adjective), DT (determiner), VB (verb, non-relative), and VBR (relative verb).
The file hw5.py includes a function read(), which reads in sentences from a file like the included file sentences, returning a list of tuple pairs, one for each sentence.
The first element of the pair is a list of tuples, one for each word in the sentences; the second element is the sentence gloss in English.
Another function, sents() takes a list like that returned by
read() and returns a list of lists of word/tag tuples, suitable for handing to a chunking grammar.
nltk.RegexpParser().
You do not need to include a rule for entire sentences.
Then use the function parse() to parse the sentences in
sentences, handling as many as possible.
(My grammar correctly chunked the noun phrases in all but the last sentence, which has a relative clause embedded in another relative clause.)