Tonight we started a crawl using Heritrix with a customized but generic processor that uses input parameters that are regular expressions. These regular expressions should determine both the article text and the date the article was written (though the date seems to be flakey, currently).
This crawl is limited in scope to two goals. The first is to test the efficiency and ability of the crawler itself. To this end it is only crawling CNN.com and saving the articles in a format that may not necessarily be the final format. Eventually it will be expanded to a wider selection of news sites, and the crawling period extended as well. Second, this crawl should provide us with a selection of articles to start testing some of the other tools we are looking to use in our project.

0 Comments:
Post a Comment
<< Home