Source/jar downloads for the project are available at:
http://www.cs.indiana.edu/~kamwoods/NewsAnalyzer.tar.gz
A postscript version of the summary paper is also available:
http://www.cs.indiana.edu/~kamwoods/directednews.ps
Source/jar downloads for the project are available at:
Have been looking at additional ways to gather stats on term usage during the process of a search. One thing I've been thinking about is performing some simple clustering on the terms, i.e. retaining usage information about terms that are selected from the collocation list and generating a measure of distance-to-original-term.
The interface is all but finished. Kam provided a nice layout of the interface and I modified it a little bit and coded it up using Swing in Java. It is intentionally simple, trying to minimize the amount of knoweldge that someone needs to know about the background stuff before actually using the system. As of now it is simply a search engine into the index of the cached news articles. Lucene has some amazingly powerful query abilities, including both boolean operators and wildcards.

Finished a new rewrite of the Named Entity / high frequency phrase detector tonight. Essentially, it now allows for more flexibility in locating collocations of different length when called by an external class now; one simply passes in the maximum N-gram length that's desired, and only those of that length (or less) are reported. Some quick testing on only the CNN portion of the crawl indicates that noun collocations of size 3 or less are ideal; reasonable enough, since most person names, places, and organizations have relatively short names.
Finished the caching portion of the project. We ended up with the following cache of news articles:
The final (or close to it) version of the crawler should be complete.
Several of the early tools we looked at have been discarded. In particular, having looked closely at the code, ANNIE no longer seems appropriate for the named-entity task. While it does a good job of recognizing and categorizing terms, the complexity and breadth of the code in the GATE project make it perhaps inappropriate for what we're trying to accomplish.