Crawling done.
The crawling finished early this morning. We're now in possession of about 10,000 blog pages consisting of around 500 MB of HTML, downloaded slowly over about 4 days. Probably much more than we need - but now we can allow the luxury of throwing away blogs that are especially hard to parse.
The crawl frontier contained something like 50,000 pages when it stopped, too, so there's always more out there if we need it.
I'm running everything through blopex now. Then maybe I'll come up with a quick perl script that gives us some metrics on post length to tell how well the extraction worked. I'll be keeping the original data around as well in case we need to re-extract.

0 Comments:
Post a Comment
<< Home