spring break stuff
Here's what we're looking at for the progress report: I've finished a more bulletproof and altogether better blopex, which keeps track of term frequency for tf/idf calculation, does much better post detection and extraction, creates its output in the format required by MEAD, and has more features built in to deal with most of the kinds of noise particular to blogs, such as timestamps, deleting followup navigation text, ignoring empty posts and guessing the likelihood of a sentence being garbage. This new version warrants a new name, blimp (BLopex IMProved).
I've also downloaded and tested
I've looked at the results of the crawl, and we've got about 7-10% of our blogs in Spanish. Filtering those out shouldn't be too tough, standard tf/idf methods should catch all of the offending pages.
Finally, I've started working on the add-ons necessary to get MEAD to do query-based summarization. That won't take too much, just a query generator. However, we ought to think of the best way to form queries, and which additional sentence-ranking methods would be useful to have for our methods.

0 Comments:
Post a Comment
<< Home