next up previous
Next: The WordSieve Algorithm for Up: WordSieve: A Method for Extraction Previous: Introduction

Content or Context?

One of the issues addressed by context theory is disambiguating sentence meanings with respect to a specific context (such as disambiguating the sentence ``Now I am here'') [15]. Similar issues arise in determining the relevant content of documents in context. For example, the explicit content of a document written in the middle ages by a Benedictine monk about food preparation is a constant in any context, but its relevant content changes depending on the context in which it is used. A person interested in finding out about the Benedictine order would use the document in a different context from a person interested in cooking, and in each context, there would be a different answer to the question ``What is this document about?''

Information retrieval agents generally retrieve based on document content [3,6,7,14,16], and commonly treat a document as an independent entity, taking into account its relationship to the entire corpus but not to the immediate group of recently-consulted documents within which it was accessed. For example, in TFIDF-based indexing, an index vector is created for each document; the magnitude of each element indicates how well that term represents the contents of the document, based on its frequency within the document and the inverse of the frequency of documents in the corpus containing that term. Thus if a term occurs very frequently in some document, but rarely occurs in the corpus as a whole, it will be heavily weighted in the term vector. In practice this approach provides robust results for many collections. However, it does not reflect potentially-useful information about access context: it considers only the relationship between documents and a corpus, not the context in which those documents are used.

Our research on WordSieve is guided by the hypothesis that the relevant features of a document depend not only on what makes it different from every other document (which is captured by TFIDF), but what makes it similar to the documents with which it was accessed. In other words, the context of a document is not reflected only in its content, but in the other documents with which it was accessed.

Figure 1: Percentage occurrence of the term 'genetic' during a user's search session.

As an illustration, consider figure 1. The graph represents the occurrence of the word 'genetic' in web pages during the course of a web search. This access pattern shows four clear partitions, distinguishable by the occurrence pattern of the word ``genetic.'' When the word 'genetic' occurred frequently the user was searching for pages about ``the use of genetic algorithms in artificial life software.'' Our hypothesis is that terms with this type of access pattern are useful indicators of the context of a user's browsing task.

TFIDF works by exhaustively analyzing the corpus, extracting and comparing word occurrences. There are two reasons why this can be undesirable. First, it is time consuming. In an information agent, the system needs to respond and learn in real time. Second, in an information agent, it is not desirable to store an entire full-text library of all the documents a user has accessed. WordSieve overcomes both of these weaknesses by being able to learn these terms in real time, and building up information incrementally rather than requiring the entire corpus to analyze at one time.

next up previous
Next: The WordSieve Algorithm for Up: WordSieve: A Method for Extraction Previous: Introduction
Travis Bauer