`P(T) = prod_{i=1}^n P(R_i|L_i)`
where T is the parse tree, i indexes the rules used to expand each of the
n non-terminal nodes in the parse tree, and Ri and Li are the right- and left-hand sides of rule i.
If we have a treebank of parsed sentences, we can learn the probabilities for each rule in a PCFG by counting the times a rule L → R is used and dividing this by the total of the counts for all rules with L as left-hand side.
If we don't have a treebank but we have a non-probabilistic parser, we can use expectation maximization (EM) to estimate the rule probabilities.
Note that this application of EM corresponds very closely to its application to estimating the translation model parameters for statistical machine translation.
Download these modules and text files. You will be using various NLTK classes and functions related to context-free grammars and parsing; see Chapter 8, especially 8.3, 8.4, and 8.8, in the NLTK Book for information on how CFGs and parsing are handled.
You will work with a short corpus consisting of the first 324 sentences of the story
Sophie's World that have less than 16 words, from the
SMULTRON treebank.
The XML file smultron_en_sophie.xml
contains the original
treebank data (including sentences with 16 or more words as well).
The file short.cfg
contains the rules (productions) for a non-probabilistic context-free grammar extracted from the SMULTRON treebank.
Note that, as with many treebank grammars, there is a very large number of part-of-speech categories.
For this assignment, it's not necessary that you be familiar with these.
Note also that "sentences" in this corpus may actually be fragments of sentences, so the top-level category is TOP
rather than S
.
short_tagged_sentences
;
the unweighted grammar is in short.cfg
.
The parser to use is nltk.parse.pchart.InsideChartParser
.
Since there is no lexicon, you will be treating POS tags rather than words as terminals when
you parse.
InsideChartParser
allows you to set the beam size for the search and the number of parses returned.
Try a value like 800 for the beam size and a value like 30 for the number of parses returned.
You will need various functions and classes from the NLTK grammar module.
nltk.parse.viterbi.ViterbiParser
.
Compare your results with the corresponding trees from the treebank, which are in the file
short_parsed_sentences
,
using the code in parseval.py
to do the tree comparison.
Also compare your results to the output of the parser using the grammar with probabilities
learned directly from the treebank;
that grammar is in the file observed_from_treebank.pcfg
.