Tuesday, March 14, 2006

The final (or close to it) version of the crawler should be complete.

The goal was to create a post-processor for Heritrix that would identify pages on a given website as articles and capture and index those articles. Because the websites and their article templates were so varied, there were two obvious ways that this could have been accomplished. First, we could write a processor that was specific to each site. In effect, we'd have a processor that works for cnn.com, one for nytimes.com, one for foxnews.com, etc. While this would certainly work and allow maximum flexibility, the overhead of coding and then compiling each specific solution as well as the obvious non-scalibility of this approach forced us to look for another solution.

The second way, and the way we chose, to find and work with these articles was to push much of the post processing from Heritrix into the actual news analyzer itself and thus create a more generic post-processor for Heritrix. This allows us not only to not have to recompile Heritrix every time we want to set it up for a new site, but also gives us a large cache of documents to experiment with to determine the best way to analyze them, given that the html itself may be useful.

As of now, the crawler works as follows. Heritrix by itself allows the filtering of URLs, so that each site is examined and the URLs that are guaranteed not to have articles, or be indices to other articles, are filtered out. This usually takes the form of video directories, file endings (.gif, .mov, etc.), and basic host names (filtering out things like ads.cnn.com). This is entirely separate from our post-processor and is done to speed up the crawl. Once this is done the crawl acts just like any other crawl. There is a nice diagram on the Heritrix site that shows the chain of processors that each URL (and associated HTML) goes through. The majority of these are provided with the base Heritrix implementation (link extraction, frontier addition, etc.). Our processor occurs near the end, in the post-processing section.

The NewsProcessor module takes the page it is given and first checks to determine if it is an article using the regular expression that can be defined as an input parameter from the UI of Heritrix. For example, in the case of CNN.com each article occurs between the tags and , so we can check to see if those exist on the page using the regular expression ".+". Note that in the past we actually took the opportunity to capture only the text that was between those tags. Now, however, if we find those tags we go ahead and capture the entire page's html. Using this method we should be able to match any site's articles. Not necessarily in that format or something similar, but because it is nice and general, we can easily match things like meta tags as well. Basically we have switched from a complicated question like "Determine if this page has an article and capture just that article if it does." to a simple yes/no question of "Does this page contain an article?" If the page is determined to contain an article, it is captured verbatim into a local file (location defined by Heritrix UI input parameters). The next step of the processor is to index the page. Apache's Lucene is used to accomplish this. The contents of the page are cleaned of html tags and then indexed (but not stored) by Lucene.

This index can then be used to find these pages later on, either by URL or by file path. For example if we wanted to find all of the articles associated with NASA, we just run a query with the word NASA in it and we get a list of articles' urls (or file paths, or both) that contain that word.

The final test of the crawler on CNN.com is currently running. When it is finished and checks out, the other crawls of whatever news sites we choose can occur in parallel. Thus we should be set up nicely for the next phase of the project, the actual news analyzer.

0 Comments:

Post a Comment

<< Home