Thursday, April 27, 2006

Source/jar downloads for the project are available at:

http://www.cs.indiana.edu/~kamwoods/NewsAnalyzer.tar.gz

A postscript version of the summary paper is also available:

http://www.cs.indiana.edu/~kamwoods/directednews.ps

Wednesday, April 19, 2006

Have been looking at additional ways to gather stats on term usage during the process of a search. One thing I've been thinking about is performing some simple clustering on the terms, i.e. retaining usage information about terms that are selected from the collocation list and generating a measure of distance-to-original-term.

LingPipe provides a nice mechanism for locating clusters given a basic proximity matrix for a set of items. One issue with actually implementing this is the fact that over the course of a search, this matrix could grow extremely large. Additionally, while the state of the learned models are retained during the operation of the interactive app, we don't have any general mechanism for generating a use history. I could throw together an interface to a postgres database in Python pretty quickly, but I'm not sure I'll get a chance to integrate that kind of functionality into the app at this point.

Still, it would be nice to have this kind of information. Being able to to point to specific term selections at key points in the search would create a specific reference for exactly why these kinds of terms can be useful.

Monday, April 17, 2006

The interface is all but finished. Kam provided a nice layout of the interface and I modified it a little bit and coded it up using Swing in Java. It is intentionally simple, trying to minimize the amount of knoweldge that someone needs to know about the background stuff before actually using the system. As of now it is simply a search engine into the index of the cached news articles. Lucene has some amazingly powerful query abilities, including both boolean operators and wildcards.



The hooks are there for Kam's interesting phrases. For now, the list of files on the left is the result of the query to the index. The pane in the lower right is where the article is displayed. There is an attempt made to pull out as much of the article by itself as possible. Mostly this is because displaying the entire page in a JEditorPane makes it incredibly ugly because of all the fancy things the news sites do. There is considerable noise (generally at the top and bottom) of the articles, sometimes extra links or advertisements, but the point is not to make it perfect but instead to direct the user to the article. It would be possible to pull out just the text but would require alot more work in determining how each news site sets up its HTML code.

Because the article pane is displaying HTML it is easy to add in more tags to highlight different words. I made a function that highlights a passed in phrase with a passed in color that highlights both query terms and will highlight the interesting phrases themselves if they appear in each article as they are browsed.

Thursday, April 13, 2006

Finished a new rewrite of the Named Entity / high frequency phrase detector tonight. Essentially, it now allows for more flexibility in locating collocations of different length when called by an external class now; one simply passes in the maximum N-gram length that's desired, and only those of that length (or less) are reported. Some quick testing on only the CNN portion of the crawl indicates that noun collocations of size 3 or less are ideal; reasonable enough, since most person names, places, and organizations have relatively short names.

The new module also allows the caller to pass in a string array of full path names to (respectively) build the testing and training models. This was a sticking point before, since we build a new test model for *each* search, and passing directories only would have required a complicated scheme of creating symlinks to temporary directories on the fly - an ugly hack that we didn't want to deal with.

As noted above, the models can be tuned by adjusting the size of the desired N-grams and the distribution of articles between testing and training. LingPipe provides a method for scoring likely collocations via a chi-squared independence statistic. While these confidence values will not be reported to the user (they would be lacking in utility unless we maintained a 1-day move percentage a la Yahoo Buzzwords), they are initially retained. Training/testing conducted previously yielded results of the following nature (explanation continued below):

Training Phase
--------------
Score: 1436180.0000000002 with :Internal Revenue
Score: 1436180.0000000002 with :Creators Syndicate
Score: 1436180.0 with :Joint Chiefs
Score: 1436180.0 with :Los Angeles
Score: 1436180.0 with :Argeree Cooks
Score: 1436179.9999999998 with :Molly Ivins
Score: 1419858.7840310165 with :Free Trial
Score: 1419858.7840310165 with :Story Tools
Score: 1403904.3369602123 with :Trial Issues
Score: 1356391.2777665984 with :Sex Crimes
Score: 1350775.1663139553 with :Member Center
Score: 1342510.477924995 with :Most Popular
Score: 1340433.733324235 with :Saddam Hussein
Score: 1320737.4690084415 with :Top Stories
Score: 1292561.0999943605 with :Ronnie Earle
Score: 1259800.876979201 with :United States
Score: 1256656.624995735 with :Angry Left
Score: 1222961.250849913 with :Supreme Court
Score: 1196815.8333304322 with :Martin Luther
...
Score: 923256.2142750212 with :Walter Cronkite
Score: 920624.2307390401 with :Independent Counsel
Score: 915304.3659807172 with :International Edition
Score: 897610.6249934724 with :Peter Pace
Score: 897608.7499738891 with :Kenneth Glenn
Score: 883799.9999828606 with :Bryce Harlow


Sample Testing Phase
--------------------
Score: 3.3844128723063843E43 with :Air Transportation
Score: 2.5216418034188786E36 with :Southern Living
Score: 2.2694776230769906E36 with :Fringe Benefits
Score: 2.1973375572443162E36 with :Appropriations Committee
Score: 1.512985082051327E36 with :Nuclear Warfare
Score: 6.967912121325137E35 with :House Appropriations
Score: 1.115855837325814E32 with :Susan Collins
Score: 7.695557498798717E31 with :Russ Feingold
Score: 2.693445124579551E31 with :Haley Barbour
Score: 7.515250206948199E28 with :Charlie Rose
Score: 1.3153473794366464E28 with :Christopher Dodd
Score: 5.2842006171654996E26 with :Nancy Grace
Score: 2.9356670095363887E26 with :Andrew Card
Score: 1.167753386768111E25 with :Learning Activities
Score: 1.167753386768111E25 with :Lottery Commission
Score: 9.308623994755084E24 with :Situation Room
Score: 2.1204851745780715E24 with :Situation Report
Score: 2.0735522149105082E24 with :The Situation
Score: 5.456812025630777E23 with :Health Treatment

Note the extremely high scored values in this case, indicating the presence of terms appearing with much greater frequency than we would expect given the background corpus. In the final tests, these differences will be mitigated by testing/training on a more reasonable corpus distribution.

Friday, April 07, 2006

Finished the caching portion of the project. We ended up with the following cache of news articles:

CNN - 3562 articles (out of 109,974 URLs seen) in 2 days 21 hours 24 minutes

New York Times - 397 articles (out of 93,580 URLs seen) in 1 day 22 hours 45 minutes

Yahoo! News - 4121 articles (out of 69,367) in 14 hours 25 minutes 17 seconds

Notice the difference between statistics. Possible reasons for this include the way each site holds files (splitting them up over multiple pages, access to archives, etc.) and how they provide access to those files (hidden behind a login, cached on a different 'archive' site, etc.).

Two interesting side notes about the Yahoo! News site. First, the entire site was crawled, in contrast to the other two sites. This indicates that Yahoo! tends to use different sites (instead of news.yahoo.com) to split up their different services. Second, many of the articles it pulled were in fact really quick blurbs that may have accompanied a picture, so their number of articles may be slightly inflated.

Tuesday, March 14, 2006

The final (or close to it) version of the crawler should be complete.

The goal was to create a post-processor for Heritrix that would identify pages on a given website as articles and capture and index those articles. Because the websites and their article templates were so varied, there were two obvious ways that this could have been accomplished. First, we could write a processor that was specific to each site. In effect, we'd have a processor that works for cnn.com, one for nytimes.com, one for foxnews.com, etc. While this would certainly work and allow maximum flexibility, the overhead of coding and then compiling each specific solution as well as the obvious non-scalibility of this approach forced us to look for another solution.

The second way, and the way we chose, to find and work with these articles was to push much of the post processing from Heritrix into the actual news analyzer itself and thus create a more generic post-processor for Heritrix. This allows us not only to not have to recompile Heritrix every time we want to set it up for a new site, but also gives us a large cache of documents to experiment with to determine the best way to analyze them, given that the html itself may be useful.

As of now, the crawler works as follows. Heritrix by itself allows the filtering of URLs, so that each site is examined and the URLs that are guaranteed not to have articles, or be indices to other articles, are filtered out. This usually takes the form of video directories, file endings (.gif, .mov, etc.), and basic host names (filtering out things like ads.cnn.com). This is entirely separate from our post-processor and is done to speed up the crawl. Once this is done the crawl acts just like any other crawl. There is a nice diagram on the Heritrix site that shows the chain of processors that each URL (and associated HTML) goes through. The majority of these are provided with the base Heritrix implementation (link extraction, frontier addition, etc.). Our processor occurs near the end, in the post-processing section.

The NewsProcessor module takes the page it is given and first checks to determine if it is an article using the regular expression that can be defined as an input parameter from the UI of Heritrix. For example, in the case of CNN.com each article occurs between the tags and , so we can check to see if those exist on the page using the regular expression ".+". Note that in the past we actually took the opportunity to capture only the text that was between those tags. Now, however, if we find those tags we go ahead and capture the entire page's html. Using this method we should be able to match any site's articles. Not necessarily in that format or something similar, but because it is nice and general, we can easily match things like meta tags as well. Basically we have switched from a complicated question like "Determine if this page has an article and capture just that article if it does." to a simple yes/no question of "Does this page contain an article?" If the page is determined to contain an article, it is captured verbatim into a local file (location defined by Heritrix UI input parameters). The next step of the processor is to index the page. Apache's Lucene is used to accomplish this. The contents of the page are cleaned of html tags and then indexed (but not stored) by Lucene.

This index can then be used to find these pages later on, either by URL or by file path. For example if we wanted to find all of the articles associated with NASA, we just run a query with the word NASA in it and we get a list of articles' urls (or file paths, or both) that contain that word.

The final test of the crawler on CNN.com is currently running. When it is finished and checks out, the other crawls of whatever news sites we choose can occur in parallel. Thus we should be set up nicely for the next phase of the project, the actual news analyzer.

Thursday, March 09, 2006

Several of the early tools we looked at have been discarded. In particular, having looked closely at the code, ANNIE no longer seems appropriate for the named-entity task. While it does a good job of recognizing and categorizing terms, the complexity and breadth of the code in the GATE project make it perhaps inappropriate for what we're trying to accomplish.

As an alternative, I have been examining the modules provided by the LingPipe project (http://www.alias-i.com/lingpipe). It's a tradeoff. One the one hand, LingPipe's method of entity location doesn't have nearly the sophistication of that provided by GATE. However, the API is much more straightforward, and it is *fast*. One of the things we've been trying to keep in mind is how to ensure that this will ultimately be a usable tool, rather than a run-once analysis task. It's encouraging that LingPipe notes as one of it's features 'on-line' training - i.e., train and tag as data becomes available.

As a further note, I've been working on alternative methods for additional inferencing. The initial plan, to fully annotate the text with part-of-speech and limited semantic tags, is going to prove difficult for an on-line tool due to the sheer amount of time required for these tasks. Again, a compromise: how to add useful analytical components without a) requiring a massive data store, b) requiring reprocessing of identical data search after search, c) frustrating the user.