Web Mining Project Blog

Saturday, March 19, 2005

spring break stuff

Here's what we're looking at for the progress report: I've finished a more bulletproof and altogether better blopex, which keeps track of term frequency for tf/idf calculation, does much better post detection and extraction, creates its output in the format required by MEAD, and has more features built in to deal with most of the kinds of noise particular to blogs, such as timestamps, deleting followup navigation text, ignoring empty posts and guessing the likelihood of a sentence being garbage. This new version warrants a new name, blimp (BLopex IMProved).

I've also downloaded and tested cidr, the MEAD clustering tool, which we'd hoped would take care of document clustering for us. I've looked at the implementation, and cidr appears unsuitable for our clustering application. So we'll have to write our own document clustering module (or we could wait for the blog clustering project to finish, and use their system:) Or, we may want to have some sort of query-based clustering. After all, we care more that our clusters agree relative to the query, and less about agreement in feature spaces which have nothing to do with the query. This initial clustering is crucial, because without it, our summaries will be schizophrenic.

I've looked at the results of the crawl, and we've got about 7-10% of our blogs in Spanish. Filtering those out shouldn't be too tough, standard tf/idf methods should catch all of the offending pages.

Finally, I've started working on the add-ons necessary to get MEAD to do query-based summarization. That won't take too much, just a query generator. However, we ought to think of the best way to form queries, and which additional sentence-ranking methods would be useful to have for our methods.

0 Comments:

Post a Comment

<< Home