Research
Overview

Research overview


For the first 20 years of my academic career, 1985-2005, my research focused on computational models of human language acquisition and language behavior. My motivations were theoretical; like most other cognitive scientists, I wanted to understand how human cognition works. For an overview of this work, go here.

Now my motivations are social/political. (I believe that all work, including research, has social and political implications and that everyone should be aware of the implications of their work. What is perhaps unusual in my case is that these implications are the starting point for the work itself. In addition to being aware of the consequences of our work, I believe that we all need an explicit vision to guide it, whatever our motivations.) The long-term vision behind my current work and that of some of my like-minded colleagues appears in this manifesto for what we call the "interface to the new information world". In brief, I want to use what I have learned from cognitive science, artificial intelligence, and computational linguistics to build tools that will enable users of the new digital technology to be better informed and, in the process, to further the democratization of information and knowledge.

If we are committed to a more democratic future, a future in which people everywhere have more power over the political and economic decisions that affect their lives and the health of the biosphere, then people everywhere must have the means to inform themselves and to inform others about their own experiences. Information is power. Informing and being informed require not only that people have access to the necessary technology but also that they can understand each other. Further, as it becomes easier and easier to make information public, people need to be able to evaluate the information they receive. I am currently focusing on the first of these two basic needs but am interested in attracting students who would like to address the first.

Within the current explosion in the quantity of information and in the means to access it, much of the world has been left behind because the information is not in a language that they understand. Most of my current work addresses this problem, the Linguistic Digital Divide (Paolillo, 2005). The L3 Project ("Learning Lots of Languages") has the long-term goal of developing a system to translate to and from as many as 100 under-represented languages of the Global South.

The nature of the task, in particular the dearth of corpora for training the system, makes this a special case of machine translation. Within machine translation, as within the larger field of computational linguistics, there has been a recent move toward statistical methods by which a system is trained to perform tasks on the basis of patterns of occurrence that it gleans from massive amounts of data. For machine translation, the relevant data are bilingual texts, as well as texts within one or the other of the two languages. The problem for L3 is that these data are largely unavailable for the languages of interest; in fact, there may be few monolingual data for these languages. This means that we must rely initially on knowledge-based methods, that is, methods based on explicit grammars of the languages and, where available, bilingual dictionaries. While these methods will not be adequate in the long run, they should serve to get the translation system off the ground for these languages.

To go beyond the rudimentary translations possible with such a system, two sorts of problems need to be solved. First, we need a way to gather more data to be used in training the system using statistical methods. For this purpose, we plan to make use of Wiki software that would present volunteer bilingual users with texts to be translated from one language to the other or with texts and translations generated by the system to be corrected. Second, we need to integrate the two kinds of knowledge, the initial symbolic knowledge embodied in the grammar and the dictionaries, and the subsequent statistical knowledge gained from training on new data. The problem of integrating these two sorts of knowledge into a single system is one of the fundamental problems in computational linguistics today.

Finally, given so many languages, it is not feasible to build explicit knowledge about every possible translation pair. Rather we expect to rely on known (or inferred) relationships between languages to generalize from one pair to another. For example, given the close relationship between Spanish and Portuguese, the system could use knowledge about how to translate from Spanish to Indonesian to guide itself in translating from Portuguese to Indonesian.

Here is a recent presentation on the motivation for the L3 Project: Keynote, Powerpoint, PDF.