A resource-light approach to learning verb valencies
Alex Rudnick
School of Informatics and Computing, Indiana University
Bloomington, Indiana, USA
alexr@cs.indiana.edu
Abstract
Here we describe a work-in-progress approach for learning valencies of
verbs in a morphologically rich language using only a morphological analyzer
and an unannotated corpus. We will compare the results from applying this
approach to an unannotated Arabic corpus with those achieved by processing the
same text in treebank form. The approach will then be applied to an unannotated
corpus from Quechua, a morphologically rich but resource-scarce language.
1 Introduction and approach
When constructing NLP systems for a new language, we are likely to want to know
the valence of its verbs, which is to say how many and which types of arguments
each verb may combine with. This information is especially helpful in
constructing stochastic parsers [7]. Some dictionaries may
provide this information. But assuming a broad-coverage digital dictionary
does exist for a given language, that dictionary may not say whether arguments
are optional for a given verb, and if they are optional, how often they occur.
An empirical approach based on a corpus or treebank would allow us to learn the
frequency with which a given verb has a certain number and type of objects. To
take a simple example from English, we would like to be able to learn that
while "eat" usually has a direct object, "put" nearly always has one. Given
an unannotated corpus, one could look at each sentence and count how many verbs
occur in it. For sentences with only one verb, one would then update the
relevant counts for that verb when it is seen with nouns that can only be the
verb's objects, for instance, because they are inflected in the accusative
case. This approach throws away the information provided by more complex
sentences, but it does not require syntactic analysis, either by a human or a
parser, and will hopefully approximate the frequencies that would be learned
from a deeper look.
We are currently developing a system that implements this approach for
morphologically rich but under-resourced languages. Particularly we would like
to apply the technique to Quechua because of our goal of developing an MT
system for it; Quechua is spoken by roughly 10 million people in the Andean
region of South America, and is thus the largest indigenous language of the
Americas [3]. Quechua encodes rather a lot of information into
its verbs, including optional evidentiality. In many cases the verb's arguments
are included in a suffix, although notably not when the objects are in the
third person [5].
The approach will only require a morphological analyzer and
an unannotated corpus for the language in question. For the morphological
analyzer, we will use Michael Gasser's AntiMorfo system
[2], which can analyze Quechua verbs, nouns, and
adjectives. Also, we have been graciously provided with the Quechua corpus
collected by CMU's AVENUE project, described in [3]. However,
to evaluate our work, we would like to use a treebank, wherein the objects of
each of the verbs in a sentence may be easily found and the occurrences of
objects counted. As far as we know, there is not yet a large treebank of
Quechua, although Rios et al. have constructed a small one [6].
2 Evaluation
In order to determine the efficacy of our approach, we will apply it to Arabic,
another morphologically rich language, which has more available resources. We
will analyze the morphology of Arabic verbs using Pierrick Brihaye's
Aramorph, a port of the Buckwalter morphological analyzer that natively
supports Unicode text [1]. For the Arabic text and treebank, we will
use the newswire data in the Arabic Penn Treebank, Part 1, Version 3, which has
both Arabic text in SGML format and as parsed trees.
This will allow us to compare the valencies learned from the unannotated corpus
with those that are more directly observable from the treebank, since objects
will be easier to find with syntactic information. If the valencies that we
discover with the unannotated approach are close to those learned from the
treebank - and we get a broad coverage over of all of the verbs observed in
the corpus - that would provide an argument that our approach works fairly
well, and we could continue using it as we acquire more textual data for the
under-resourced languages.
References
- [1]
-
Pierrick Brihaye. AraMorph morphological analyzer for Arabic.
http://www.nongnu.org/aramorph/
- [2]
-
Michael Gasser. Antimorfo morphological analyzer for Quechua.
http://www.cs.indiana.edu/~gasser/software.html
- [3]
-
Christian Monson, Ariadna Font Llitjos, Roberto Aranovich, Lori Levin, Ralf
Brown, Eric Peterson, Jaime Carbonell, and Alon Lavie. 2006. Building NLP
Systems For Two Resource-Scarce Indigenous Languages: Mapudungun and Quechua.
In LREC 2006: Fifth International Conference on Language Resources and
Evaluation.
- [6]
-
Rios, A; Göhring, A; Volk, M. 2009. A Quechua-Spanish parallel treebank. In:
7th Conference on Treebanks and Linguistic Theories, Groningen, 2009 - 2009.
- [5]
-
Serafin M. Coronel-Molina. 2002. Quechua Phrasebook. Lonely Planet, Victoria,
Australia.
- [6]
-
Annette Rios, Anne Göhring and Martin Volk. 2009. A Quechua-Spanish parallel
treebank. In: 7th Conference on Treebanks and Linguistic Theories, Groningen,
2009 - 2009.
- [7]
-
Adam Przepiórkowski. 2009. Towards the Automatic Acquisition of a Valence
Dictionary for Polish. In: Małgorzata Marciniak and Agnieszka Mykowiecka,
eds., Aspects of Natural Language Processing: Essays Dedicated to Leonard Bolc
on the Occasion of His 75th Birthday, Springer Verlag, LNCS series 5070, pp.
191-210.
File translated from
TEX
by
TTH,
version 3.85.
On 22 Oct 2010, 23:00.