Homework 3: morphology and weighted FSTs

Email me your solution (three files, one for each FST) by Monday, February 4, 11:59pm.

Background: Swahili verb morphotactics

Assume (not quite accurately) that Swahili verbs consist of a sequence of morphemes in the following order and in the following classes ("TAM" = "tense-aspect-modality"):

SUBJECT + (TAM) + (OBJECT) + ROOT + (DERIVATION) + TERMINATION

SUBJECT consists of six prefixes representing subjects when the verb is affirmative and six other prefixes representing subjects when the verb is negative.

TAM consists of (for our purposes) three prefixes representing tenses in the affirmative (or negative in one case) and one additional prefix representing a tense in the negative. In the present negative there is no TAM prefix.

OBJECT consists of five prefixes representing objects; note that one is ambiguous.

ROOT represents the root of a particular verb. Here are the six we will be working with for this assignment: kop 'borrow', oan 'get married', pig 'hit', sem 'speak', on 'see', pend 'like'.

DERIVATION consists of suffixes that modify the meaning of the verb root. We will only consider one, the causative suffix, which has two forms: ish when the last vowel of the root is a, i, or u, and esh when the last vowel of the root is e or o.

TERMINATION is a final suffix; we will consider only three possibilities.

Here are some possible Swahili verbs ("+"s have been added between the morphemes to make them more readable).

Here are some forms that are impossible (the characters in bold are wrong).

The program

Make sure you have installed NLTK. We will be using the feature structure module from NLTK for this assignment. In addition, download these additional modules and examples (do not use the fst.py module that comes with NLTK).

FST files

You need a separate file for each finite state transducer you create. Look at en_words.fst and en_pl_es.fst to see examples.

A state is declared to be the input state (there can only be one) with a line like this:

-> state_name

A state is declared to be a final state (there can be any number) with a line like this:

state_name ->

A transition is created with a line like this:

src_state_name -> dest_state_name [input_str:output_str] weight

input_str must be a single character (output_str need not be). If either input_str or output_str is missing, that side is interpreted as epsilon (no character consumed). Both sides may be epsilon ("[:]"). If the colon is missing, the single character that appears represents both the input and the output string. "?" represents the "unknown" character class, that is, any character not in the "sigma alphabet" (for our purposes, the characters that appear explicitly on transitions). "[?]" or "[?:?]" mean the same unknown character in the input and output strings. The weight may be left out, in which case the transition is given the default weight for the relevant semiring.

You can create a simple sequence of states and transitions joining two other states with a line like this:

src_state_name -> dest_state_name <input_str:output_str> weight

In this case a separate state is created for each character in input_str, and these states are joined to each other and to the source and destination states by transitions. If ":" appears between the "<" and the ">", whatever is to the right of it (possible nothing) appears on the output side of the last (rightmost) transition. If there is no ":", each transition's output string is the same as its input string. The weight is optional; if it appears, it is associated with the last (rightmost) transition.

You can create sets of characters like this:

charset_name = {char1, char2, ...}

You can then use the character set name in place of ordinary characters in input and output strings. This results in a separate transition for every character in the set.

Note that at the lexical level, the FST must generate at least one character. In en_words.fst, this is the character "&" signaling the end of the word.

Functions

To create an FST, use the static FST method load(), which takes a filename and an optional "weighting", that is, an instance of the Semiring class.

>>> lex = FST.load('en_nouns.fst', UNIFICATION_SR)

To transduce a string, use the FST method transduce(), which takes the string and an optional initial weight. Create a feature structure weight with the Semiring method parse().

>>> lex.transduce('cats')
[('&', set([[num='plur', -poss, stem='cat']]))]
>>> lex.transduce('cats', UNIFICATION_SR.parse('[num=sing]'))
[]

To run an FST in the other direction, you first need to invert it.

>>> lex_inv = lex.inverted()
>>> lex_inv.transduce('&', UNIFICATION_SR.parse("[num='sing', +poss, stem='dog']"))
[("dog's", set([[num='sing', +poss, stem='dog']]))]

To compose two FSTs, use the static method compose().

>>> es = FST.load('en_pl_es.fst', UNIFICATION_SR)
>>> es_lex = FST.compose(es, lex)
...
<__main__.FST object at 0x15a5ff0>
>>> es_lex.transduce('boxes')
[('&', set([[num='plur', -poss, stem='box']]))]

FSTs know how to pretty-print themselves.

>>> print es FST en_pl_es.fst end -> # Final state sib -> # Final state -> start # Initial state start -> # Final state start -> start [?:?] ...

See fst.py for other useful methods.

What you have to do

  1. Write FSTs that handle the variation in the forms of 'him/her' and the subject form of 'you (pl.)' and of the vowel in the causative suffix. These rules apply to other cases besides these, so your FST should not be specific to particular morphological contexts. The best way to do this is to assume a special character at the lexical level that gets realized at the surface in two different ways depending on what precedes (for the causative suffix) or what follows (for 'him/her' and 'you (pl.)').
  2. Write an FST that represents the morphotactics of Swahili verbs and includes the six stems listed above.
  3. Compose the three transducers into one and test it.
    >>> mw = FST.load('sw_mw.fst', UNIFICATION_SR)
    >>> ei = FST.load('sw_ei.fst', UNIFICATION_SR)
    >>> lex = FST.load('sw_verbs.fst', UNIFICATION_SR)
    >>> sw = FST.compose(FST.compose(mw, ei), lex)
    ...
    >>> sw.transduce('nilisema')
    [('&', set([[-caus, obj='none', pol='aff', sbj=[num='sing', prs=1], stem='sem', tns='past']]))]
    >>> sw.transduce('hawampigi')
    [('&', set([[-caus, obj=[num='sing', prs=3], pol='neg', sbj=[num='plur', prs=3], stem='pig', tns='pres']]))]
    >>> sw.transduce('tutawakopesheni')
    [('&', set([[+caus, obj=[num='plur', prs=2], pol='aff', sbj=[num='plur', prs=1], stem='kop', tns='fut']]))]
    >>> sw.transduce('nilisemi')
    []
    >>> sw.transduce('hutamona')
    []
    >>> sw_inv = sw.inverted()
    >>> sw_inv.transduce('&', UNIFICATION_SR.parse("[-caus, obj=[num='sing', prs=3], pol='neg', sbj=[num='plur', prs=1], stem='on', tns='pres']"))
    [('hatumwoni', set([[-caus, obj=[num='sing', prs=3], pol='neg', sbj=[num='plur', prs=1], stem='on', tns='pres']]))]
    >>> sw_inv.transduce('&', UNIFICATION_SR.parse("[+caus, obj=[num='plur', prs=2], pol='aff', sbj=[num='plur', prs=3], stem='sem', tns='pres']"))
    [('wanawasemesheni', set([[+caus, obj=[num='plur', prs=2], pol='aff', sbj=[num='plur', prs=3], stem='sem', tns='pres']])), ('wanawasemesha', set([[+caus, obj=[num='plur', prs=2], pol='aff', sbj=[num='plur', prs=3], stem='sem', tns='pres']]))]

    Note that the features given here are not the only possibility.

Home

Calendar

Coursework

Notes

Readings

Code

Resources


IU | INFO | CSCI

Contact instructor