Finished the caching portion of the project. We ended up with the following cache of news articles:
CNN - 3562 articles (out of 109,974 URLs seen) in 2 days 21 hours 24 minutes
New York Times - 397 articles (out of 93,580 URLs seen) in 1 day 22 hours 45 minutes
Yahoo! News - 4121 articles (out of 69,367) in 14 hours 25 minutes 17 seconds
Notice the difference between statistics. Possible reasons for this include the way each site holds files (splitting them up over multiple pages, access to archives, etc.) and how they provide access to those files (hidden behind a login, cached on a different 'archive' site, etc.).
Two interesting side notes about the Yahoo! News site. First, the entire site was crawled, in contrast to the other two sites. This indicates that Yahoo! tends to use different sites (instead of news.yahoo.com) to split up their different services. Second, many of the articles it pulled were in fact really quick blurbs that may have accompanied a picture, so their number of articles may be slightly inflated.

0 Comments:
Post a Comment
<< Home