Thursday, March 02, 2006

As stated previously, Heritrix is a complex and huge system. However, it is designed specifically for modularity and is relatively easy to use. These two things provide both pros and cons. On one hand, it was easy to set up the system, add a post-processor and run a crawl. On the other hand, it is clear that integrating it into a full system will not be trivial. In the end it might turn out best to have the crawling portion of our system be almost entirely separate from the rest of it, using the Heritrix system to crawl news articles, index them with Lucene and cache them somewhere locally accesible for use with the inferencing system.

Mundane Heritrix Details

Working with Heritrix, while straightforward, started off slowly. The reason for this is that it has two build modes. This isn't necessarily a bad thing, but it's important to realize. The first type of build is for deployment and is accomplished by using a separate tool from Eclipse called Maven. Maven is a tool developed by Apache that is used to coordinate builds of complex systems. It can be obtained from here (make sure to get the 1.0.x version and not the 2.0.x version). It is easy to install and use, and in the case of Heritrix is simply invoked by using "maven dist" in the root path of the Heritrix code directory hierarchy.

The other type of build is the development build that can be run directly from Eclipse. It is important to make the distinction as Maven takes a considerable amount of time to build Heritrix, and the development process requires many builds. To invoke the development build, Eclipse users must add the "-Dheritrix.development" command line option for the virtual machine. After this is done, Heritrix can be built and run from Eclipse with full functionality.

Adding a post-processing module is quite easy. The development manual for Heritrix offers a nice processor example. Using that, it was straightforward to create a post-processor that captures the content of a page. At a high level a processor is just a class that is a subclass of the Process class. It is forced to implement a single method named "innerProcess" that does anything we want it to do. To allow the user to choose this processor we add it to the post-processor package and then tell Heritrix that it exists by modifying the src/conf/modules/Processor.options file. Once this is done and Heritrix is built and run, the module can be chosen as a post-processor and for every page downloaded, the post-processor can act on it.

Matching Articles on CNN.com

Using the previously mentioned habit of CNN.com to wrap their articles with specific tags (the comments endclickprintexclude and endclickprintinclude specifically), we are able to use Java regular expressions to match those tags and pull out the articles. We can also use Java regular expressions to toss out all the HTML tags within the article itself. Whether this is the best approach for a final product is a question that has yet to be answered. Ideally it would be best to allow an external template to be read in at run-time such that each seed would have its own article matcher, thus allowing a user to add a new news site without having to completely recompile the crawler. Whatever we decide in the end, though, writing a simple method to match articles for each of our seed hosts should be sufficient for now.

The article matcher is working for CNN right now. Other sites will be added in the near future.

0 Comments:

Post a Comment

<< Home