Email me your solution (three files, one for each FST) by Monday, February 4, 11:59pm.
Assume (not quite accurately) that Swahili verbs consist of a sequence of morphemes in the following order and in the following classes ("TAM" = "tense-aspect-modality"):
SUBJECT + (TAM) + (OBJECT) + ROOT + (DERIVATION) + TERMINATION
SUBJECT consists of six prefixes representing subjects when the verb is affirmative and six other prefixes representing subjects when the verb is negative.
TAM consists of (for our purposes) three prefixes representing tenses in the affirmative (or negative in one case) and one additional prefix representing a tense in the negative. In the present negative there is no TAM prefix.
OBJECT consists of five prefixes representing objects; note that one is ambiguous.
ROOT represents the root of a particular verb. Here are the six we will be working with for this assignment: kop 'borrow', oan 'get married', pig 'hit', sem 'speak', on 'see', pend 'like'.
DERIVATION consists of suffixes that modify the meaning of the verb root. We will only consider one, the causative suffix, which has two forms: ish when the last vowel of the root is a, i, or u, and esh when the last vowel of the root is e or o.
TERMINATION is a final suffix; we will consider only three possibilities.
Here are some possible Swahili verbs ("+"s have been added between the morphemes to make them more readable).
Here are some forms that are impossible (the characters in bold are wrong).
Make sure you have installed NLTK.
We will be using the feature structure module from NLTK for this assignment.
In addition, download these additional modules and examples (do
not use the fst.py module that comes with NLTK).
You need a separate file for each finite state transducer you create.
Look at en_words.fst and en_pl_es.fst to
see examples.
A state is declared to be the input state (there can only be one) with a line like this:
A state is declared to be a final state (there can be any number) with a line like this:
A transition is created with a line like this:
input_str must be a single character (output_str
need not be).
If either input_str or output_str is missing,
that side is interpreted as epsilon (no character consumed).
Both sides may be epsilon ("[:]").
If the colon is missing, the single character that appears represents
both the input and the output string.
"?" represents the "unknown" character class, that is, any character not in
the "sigma alphabet" (for our purposes, the characters that appear explicitly
on transitions).
"[?]" or "[?:?]" mean the same unknown character in the input and output strings.
The weight may be left out, in which case the transition is given the
default weight for the relevant semiring.
You can create a simple sequence of states and transitions joining two other states with a line like this:
In this case a separate state is created for each character in input_str, and these states are joined to each other and to the source and destination states by transitions.
If ":" appears between the "<" and the ">", whatever is to the right of it (possible nothing) appears on the output side of the last (rightmost) transition.
If there is no ":", each transition's output string is the same as its input string.
The weight is optional; if it appears, it is associated with the last (rightmost) transition.
You can create sets of characters like this:
You can then use the character set name in place of ordinary characters in input and output strings. This results in a separate transition for every character in the set.
Note that at the lexical level, the FST must generate at least one character.
In en_words.fst, this is the character "&" signaling the end of the
word.
To create an FST, use the static FST method load(), which takes a filename and an optional "weighting", that is, an instance of the Semiring class.
To transduce a string, use the FST method transduce(), which takes the string and an optional initial weight.
Create a feature structure weight with the Semiring method parse().
To run an FST in the other direction, you first need to invert it.
To compose two FSTs, use the static method compose().
FSTs know how to pretty-print themselves.
See fst.py for other useful methods.
Note that the features given here are not the only possibility.