Projects

L3

Within the current explosion in the quantity of information and in the means to access it, much of the world has been left behind because the information is not in a language that they understand. The L3 project ("Learning Lots of Languages") has the long-term goal of developing a system to translate to and from many under-represented languages of the Global South and (less ambitiously) of creating tools to be used in information retrieval and computer-assisted language learning with these languages.

The nature of this translation task, in particular the dearth of data for training the system, makes this a special case of machine translation. Within machine translation, as within the larger field of human language technology, there has been a recent move toward statistical methods by which a system is trained to perform tasks on the basis of patterns of occurrence that it gleans from massive amounts of data. For machine translation, the relevant data are bilingual texts, as well as texts within one or the other of the two languages. The problem for L3 is that these data are largely unavailable for the languages of interest; in fact, there may be few monolingual data for these languages. This means that we must rely initially on knowledge-based methods, that is, methods based on explicit grammars of the languages and, where available, bilingual dictionaries. While these methods will not be adequate in the long run, they should serve to get the translation system off the ground for these languages.

To go beyond the rudimentary translations possible with such a system, two sorts of problems need to be solved. First, we need a way to gather more data to be used in training the system using statistical methods. For this purpose, we plan to make use of Wiki software that would present volunteer bilingual users with texts to be translated from one language to the other or with texts and translations generated by the system to be corrected. Second, we need to integrate the two kinds of knowledge, the initial symbolic knowledge embodied in the grammar and the dictionaries, and the subsequent statistical knowledge gained from training on new data. The challenge of integrating these two sorts of knowledge into a single system is one of the fundamental problems in human language technology today.

Finally, given so many languages, it is not feasible to build explicit knowledge about every possible translation pair. Rather we expect to rely on known (or inferred) relationships between languages to generalize from one pair to another. For example, given the close relationship between Spanish and Portuguese, the system could use knowledge about how to translate from Spanish to Indonesian to guide itself in translating from Portuguese to Indonesian.

Information Customs

It is clearly not enough to have access to comprehensible information. In languages like English, there may be far too much information on a given topic. Information retrieval techniques can succeed in finding what is relevant to a query but cannot help filter documents by the quality of their content. Language is used not only to inform, but also to persuade and to deceive. Techniques for persuading and deceiving are probably as old as language itself, and nothing has changed in this regard with the Digital Revolution. If anything, it is now possible to deceive people on a more massive scale than ever before; the idea of "information warfare" is now taken seriously.

How can information consumers read critically? One way is to recognize that every document is written from a particular perspective or ideology and to evaluate the content accordingly. One key way in which a writer's ideology is reflected in writing on a issue is through the framing of the issue, a notion developed at length by the cognitive scientist George Lakoff over the last 25 years. Writers rarely say openly how they are framing an issue, let alone explain all of the presuppositions and implications of the frame. In a sense, writers "smuggle" this background information into their argument, and without an understanding of how an issue is being framed, a reader can easily be manipulated. Recognizing the perspective of the writer and the way framing is used in writing is a fundamental aspect of critical reading, which in turn is necessary for informed decisions. We believe that critical reading can benefit from statistical analyses that are impossible for the readers themselves to perform. The Information Customs project seeks to develop tools that can help to expose how information is "smuggled" into texts by writers.

One way in which writers on a given topic differ is in the words they choose to use to make their case. Given categories (for example, "liberal" and "conservative") for a set of documents on a topic, it is a straightforward matter to determine which words are the most strongly associated with one or another category. For example, we are now focusing on documents that address the recent debate on immigration in the United States, and conservative writers are more likely to use the word illegal while liberal writers are more likely to use the word undocumented. More interestingly, we can start with uncategorized documents on a topic, find the words that distinguish the documents from one another, and then cluster the documents on the basis of how these words co-occur within them. These clusters could give readers a rough idea of how particular writers on a topic group together, whether or not the grouping corresponds to one of the familiar political categories.

But writers approaching a topic from different perspectives and with different frames use many of the same words of course. The question we would like to ask is whether these words have the same meanings for these writers. Does the word democracy mean the same thing when it used by former US President George Bush as when it used by the linguist and political writer Noam Chomsky, and if not, how might we measure the differences in meaning? We start with the notion of word sense from linguistics, what distinguishes suit when it refers to clothing from suit when it refers to a legal process. Much research in human language technology has been devoted to figuring out how many senses words have and to detecting which sense a word is used in in a particular sentence. One approach to this problem, that of Hinrich Schütze, is to define senses in terms of co-occurrences with other words: suit in the one sense tends to co-occur with words like wear and clothing; suit in the other sense tends to co-occur with words such as legal and court. Using a variant of Schütze's approach, we are investigating whether the notion of word sense can be extended to the ways in which a word like democracy is used by writers coming from different political perspectives. For a given topic, for example, immigration, we will again be looking for "polarizing" words that distinguish writers from one another. But in this case, it is not words that are used by one group and not the other, but rather words that are used by both groups but with different senses, as measured by the linguistic contexts of the words.

Ultimately we would like to have a tool that users could bring up in a browser that would highlight particular polarizing words in the document they are reading. Clicking on these words would give them more information, about whether these words are used by particular writers on the topic and not others or whether these words are used by most writers but with different senses. Readers interested in further information could see examples of these words from other writers and the names of well-known writers who use the words as in the current document.