Speakers: Steven Bird - Ron Kaplan - Aravind Joshi - Lauri Karttunen - Richard Sproat
Investigating Natural Language with Finite-State Techniques
Steven Bird, University of Pennsylvania
Over the past decade, the size and complexity of digital linguistic resources has grown dramatically. From the social sciences to information technology, scientists and technologists have been creating rich records of natural communicative interaction, structuring them into databases, annotating them, and constructing sophisticated theories, models and products. Over the longer term, it is clear that these databases, theories, models and products will have far-reaching social, economic, industrial, and strategic impacts on the multilingual, human-centered global information society of the future. Ten years ago, this work could only be conducted in well-funded university and industrial research laboratories. A case in point is the Linguistic Data Consortium, which has published 200 annotated linguistic databases in the last decade, the largest being 500Gb in size, and has an annual operating budget running into several millions. Today, this kind of work is within the reach of small research groups and private individuals, thanks to the declining cost of technologies for digital signal capture and mass storage, the increasing bandwidth of the web for low-cost dissemination of large databases, and the rise of XML and Unicode for the interchange of structured data using standard character encodings. Interesting examples of this proliferation are to be found in the areas of cultural heritage informatics and endangered language preservation. When we examine the kinds of annotations that people have created for recorded linguistic signals, we observe an extremely diverse range of structures and formats. There are phonetic transcriptions and intonational codes arranged into complicated prosodic hierarchies. There are markup schemes for discourse functions and turn-taking in multi-party conversation, handling speaker overlap and backchannel cues. There are schemes for coding speech disfluencies -- hesitations, false starts and repairs. Many long-standing research questions concerning the interplay between these levels can now be studied on significant data samples. In recent research with Mark Liberman, the author has shown that the diverse expressive requirements of these different structures and formats can be encompased by a simple model called Annotation Graphs. These are labelled directed graphs in which the direction of the arrows respects the flow of time in an associated set of digital signals. Nodes of the graphs may be anchored to time-offsets in the signals, representing the temporal locus of the annotations stored in arc labels. Once annotation graphs are construed as finite-state automata, then it becomes possible to apply well-understood computational techniques to complex multilayer annotations. This presentation will begin with an introduction to linguistic databases, focussing on examples which are interesting for the complexity of their annotations or the importance of their role in technology development or language documentation. The annotation graph model will be briefly described, along with the mapping to finite-state machines. Finally, various empirical questions will be outlined along with their mapping into finite-state machines which can select and/or transform the annotations. This will demonstrate the practical value of the mapping from expressive linguistic representations to tractable computational representations, and the possibilities for language scientists and technologists to develop and test their models and hypotheses on large and complex databases.
Introduction to Finite-State Methods in Computational Linguistics
Aravind Joshi, University of Pennsylvania
Regular relations and morphological analysis
Ron Kaplan, Xerox Palo Alto Research Center
Words in natural languages are often made up of parts with individual meanings, and the meanings of words as a whole are composed from the meanings of their parts. For example, the word "walked" is made up of the stem "walk" and the past-tense suffix "ed", and the word can be used to describe a walking event that ocurred in the past. Morphology is the study of how word-parts can be combined together to make up complete words. This example is deceptively simple in suggesting that a word is formed simply by concatenating the parts that make it up. It is more common to find that the parts change their forms when they are put together. For example, combining the parts "stop" and "ed" results in "stopped", with an extra "p", rather than "stoped", "fake"+"ing" produces "faking" (the "e" disappears), and "tie" + "ing" gives "tying". These examples are still simple compared to what can happen in other languages. In many languages attaching a suffix or prefix to a stem can cause the stem itself to change its spelling and/or pronunciation. Attaching certain suffixes in German can cause vowels earlier in the word to change, for instance. A vestige of this can be seen in some irregular words in English, such as "swim"+"ed" = "swam", but such phenomena are systematic in German and other languages. Linguists have developed succinct ways of describing how stems, prefixes, and suffixes change when they are put together to make up words. The rule e -> 0 | _ing specifies that "e" disappears (is zeroed out) before "ing" and thus accounts for the spelling of "faking". If that rule is applied to "tie"+"ing", then the form "tying" is produced by then submitting the result "tiing" to the rule i -> y | _ing Rules of this sort are very effective in characterizing the kinds of changes that occur in a wide variety of languages. But they have complicated formal properties and it is very difficult to build computational systems that apply them directly. However, the pairs of input-output strings for such a rule form a regular relation, and such rules can therefore be systematically converted into equivalent finite-state transducers. These devices are very easy to implement and can be applied with great efficiency.
Linguistically-Motivated Extensions to Finite-State Methods
Lauri Karttunen, Xerox Palo Alto Research Center
Finite-state morphology has proved very successful in implementing large-scale, robust and efficient morphological analyzer-generators for a wide variety of langugaes, including the commencially important European languages and non-Indo-European examples like Finnish, Turkish, Hungarian, and Japanese. The two central problems of morphology in these applications are
word formation. Words arfe typically composed of smaller units of meaning, called morphemes. The morphemes that make up a word must be combined in the right way: piti-less-ness is a word of English but *piti-ness-less is not.
morpholocial alternations. The shape of a morpheme often depends on the environment: pity is realized as piti in the context of less, die as dy in dying.
While morphological alternations pose no problem whatsoever to current finite-state approaches, it has long been recognized that the finite-state methods have serious limitations in handling word formation processes in certain languages. Most natural languages are similar to English in that they construct words by concatenating morphemes together in strict orders. But some languages have additional word-formation processes that are not concatenative. For example, in Malay plural formation involves reduplication of the stem morpheme: bagi 'suitcase' --> bagi-bagi 'suitcases'. In Arabic, words such as kutib 'was written' are traditionally analyzed as consisting of a root ktb, a syllabic pattern CVCVC, and a vocalism ui. The component morphemes are combined by a process called interdigitation. Interdigitation, infixation, and reduplication have long been recognized as fundamental challenges to finite-state morphology.
The Linguistic Significance of Finite-State Techniques
Richard Sproat, AT&T Labs
Regular languages are the simplest class of formal language, and they were also the first to be rejected as forming an adequate basis for the description of natural languages. In a famous (though in this case fallacious) argument, Chomsky (1957) argued that center embedded constructions in English cannot be handled by finite-state devices, and that therefore English itself cannot be a regular language. It seems paradoxical then, that after more than four decades after Chomsky's and other similar arguments, we should see a resurgence of finite-state techniques applied to natural language. What, if anything, does this signify for theories of language? One thing that it signifies is that, now more than ever, one must understand the difference between a linguistic model and its computational implementation. Arguments against finite-state descriptions of particular phenomena come down to one or both of the following issues:
1. The phenomenon in question cannot be described using finite-state techniques, if you assume that there is no theoretical justification for a bound on the length of such constructions. Without this assumption, such arguments evaporate.
2. Even if you accept a bound on the length, then there are still constructions that cannot be *elegantly* described by finite-state techniques.
But a *theoretically* unbounded construction can certainly be approximated to some predetermined depth by a finite-state device --- not an unreasonable move given that human performance also breaks down with certain classes of construction if they become too complex. And many tools now exist that allow one to compile a high level linguistic description into a finite-state machine, so that the elegance argument is also compromised. I will illustrate this point with examples drawn from morphology and from a formally less well-studied domain: the analysis of writing systems. The phenomena in question are either arguably non-regular, or at least not elegantly so described, yet it will be argued that one can use higher level descriptions that can be as elegant as one wants, and that methods exist for compiling the description down into a (possibly approximate) finite-state machine. The slides may be found at this site.