Quick update. We've been pulling down packages and examining the relevant frameworks over the past week. Our crawling needs are fairly basic; we'll be pulling news articles from a limited number of sources over a given timeframe, with a single thread focusing on each host.
As noted previously, we decided on Heritrix for simplicity and speed of implementation - although the package itself is huge, with an elaborate API. It's well documented; we won't have to worry about little parsing issues or focus on modularity - it's all there. It interfaces nicely with Lucene (both are Java-based), has an easily configurable crawl scope, and once you let it loose you can track it through a web interface.
Additionally, we've been examining the article layout of major news sites for post-processing. Ideally, a robust implementation would have a large set of definitions for locating and parsing the relevant article header/body/date text. It would probably be fast to store everything as XML do some kind of fast template matching. For now, however, we've been building up a list of site specific labels that we can use to quickly extract a clean corpus.
CNN, for example, provides hooks for locating relevant printable portions of a page. Article header info (title, date, etc.) always appears between startclickprintinclude and startclickprintexclude comments, while the article body has similar extraction points. Automatically generated dates appear as plaintext written out with a javascript document.write method. So it looks like for the most part, stripping out irrelevant page content will be trivial.

0 Comments:
Post a Comment
<< Home