A resource-light approach to learning verb valencies

Alex Rudnick
School of Informatics and Computing, Indiana University
Bloomington, Indiana, USA
alexr@cs.indiana.edu

Abstract

Here we describe a work-in-progress approach for learning valencies of verbs in a morphologically rich language using only a morphological analyzer and an unannotated corpus. We will compare the results from applying this approach to an unannotated Arabic corpus with those achieved by processing the same text in treebank form. The approach will then be applied to an unannotated corpus from Quechua, a morphologically rich but resource-scarce language.

1  Introduction and approach

When constructing NLP systems for a new language, we often want to know the valence of its verbs, which is to say how many and which types of arguments each verb may combine with. This information is especially helpful in constructing stochastic parsers [7]. Some dictionaries may provide such information, but even assuming that a broad-coverage digital dictionary exists for a given language, that dictionary may not say whether arguments are optional for a given verb, or how often they occur.
An empirical approach based on a corpus or treebank allows us to learn the relative frequency with which a given verb takes specific types of arguments. As a simple example from English, we would like to learn that while "eat" usually has a direct object, "put" nearly always has one. In order to automatically learn this information for resource-scarce, morphologically rich languages, we are currently implementing a system that requires only an unannotated corpus and a morphological analyzer; other recent approaches have required more syntactic knowledge, in the form of treebanks, parsers, or chunkers.
Our approach starts by processing each sentence in the corpus with the morphological analyzer, and finding all of the verbs. For sentences with only one verb, we then count the occurrences of nouns that seem to be, because of inflection, the arguments of the verb, and also words that are plausible candidates to be the verb's arguments, where "plausibility" will be determined by a small number of language-specific heuristics. For example, a noun inflected with the accusative case in a sentence with a verb and a clear subject will likely be the object of that verb. This approach throws away the information provided by more complex sentences (those with multiple verbs and embedded clauses), but it does not require syntactic analysis, either by a human or a parser, and will hopefully approximate the frequencies that would be learned from a deeper syntactic look. Noisy observations will be filtered out using an approach similar to the one described by Przepiórkowski [7]. For consistency with other work, we will adopt the valency theory used by Bielický and Smrž in their 2008 work, which records whether a given verb usage contains an explicit Actor, Addressee, Patient, Effect, and Origin.
We would like to apply the technique to Quechua because of our medium-term goal of developing an MT system for it; Quechua is spoken by roughly 10 million people in the Andean region of South America, and is thus the largest indigenous language of the Americas [4]. Quechua encodes rather a lot of information into its verbs, including optional evidentiality. In many cases the verb's arguments are included in a suffix, although notably not when the objects are in the third person [5]. For the Quechua morphological analyzer, we will use Michael Gasser's AntiMorfo system [3], which can analyze Quechua verbs, nouns, and adjectives. Also, we have been graciously provided with the Quechua corpus collected by CMU's AVENUE project, described in [4]. However, to evaluate our work, we would like to use a treebank, wherein the objects of each of the verbs in a sentence may be easily found and the occurrences of objects counted. As far as we know, there is not yet a large treebank of Quechua, although Rios et al. have constructed a small one [6]. As the work progresses, we will make note of the differences in the distributions of verb usages between sentences with only one verb, which the system will be able to handle without use of a treebank, and sentences with multiple verbs and embedded clauses, which we will not try to handle without a deep parser.

2  Evaluation

In order to determine the efficacy of our approach, we will apply it to Arabic, another morphologically rich language, which has more available resources. We will analyze the morphology of Arabic verbs using Pierrick Brihaye's Aramorph, a port of the Buckwalter morphological analyzer that natively supports Unicode text [2]. For the Arabic text and treebank, we will use the newswire data in the Arabic Penn Treebank, Part 1, Version 3, which has both Arabic text in SGML format and as parsed trees.
This will allow us to compare the valencies learned from the unannotated corpus with those that are more directly observable from the treebank, since each verb's arguments will be easier to find with syntactic information. If the valencies that we discover with the unannotated approach are close to those learned from the treebank, and we get a broad coverage over the verbs observed in the corpus, then this would provide an argument that the technique works fairly well for Arabic, and we could continue using it as we acquire more textual data for more under-resourced languages. We'll additionally report on the distribution of verbs in our Quechua data, and how many of them occur in one-clause sentences as opposed to sentences with multiple verbs.

References

[1]
Viktor Bielický and Otakar Smrž. Building the Valency Lexicon of Arabic Verbs. LREC (2008)
[2]
Pierrick Brihaye. AraMorph morphological analyzer for Arabic. http://www.nongnu.org/aramorph/
[3]
Michael Gasser. Antimorfo morphological analyzer for Quechua. http://www.cs.indiana.edu/~gasser/software.html
[4]
Christian Monson, Ariadna Font Llitjos, Roberto Aranovich, Lori Levin, Ralf Brown, Eric Peterson, Jaime Carbonell, and Alon Lavie. Building NLP Systems For Two Resource-Scarce Indigenous Languages: Mapudungun and Quechua. In LREC 2006: Fifth International Conference on Language Resources and Evaluation. (2006)
[5]
Serafin M. Coronel-Molina. Quechua Phrasebook. Lonely Planet, Victoria, Australia. (2002)
[6]
Annette Rios, Anne Göhring and Martin Volk. 2009. A Quechua-Spanish parallel treebank. In: 7th Conference on Treebanks and Linguistic Theories, Groningen. (2009)
[7]
Adam Przepiórkowski. Towards the Automatic Acquisition of a Valence Dictionary for Polish. In: Małgorzata Marciniak and Agnieszka Mykowiecka, eds., Aspects of Natural Language Processing: Essays Dedicated to Leonard Bolc on the Occasion of His 75th Birthday, Springer Verlag, LNCS series 5070, pp. 191-210. (2009)



File translated from TEX by TTH, version 3.85.
On 8 Dec 2010, 00:15.