Web Mining Project Blog

Wednesday, April 20, 2005

Now with 50% more updates!

We've gotten enough information from our user study to make some preliminary conclusions, though most aren't statistically significant. We presented our work in the NaN meeting on Monday, and it seemed to go well... so maybe we can use something close to that presentation when we present to the class. That means all we have to do is come up with the paper, and the hardest thing about that will be deciding what not to write about.

Things are going well.

Monday, April 11, 2005

Updates

Just updating since we haven't in a while. We've got everything *almost* ready to roll out our user study. What hung us up at the last minute was some snags we ran into in scaling our algorithms up to the full data set - small, information-poor blog posts turned out to be more of a problem than we anticipated. This wasn't too hard to fix, but we ended up having to re-extract and re-index everything, which took compute time. Now that this is worked out, we just need to re-run the summarization stuff and regenerate the user study data before we should be good to go. Watch this space!

Saturday, March 19, 2005

spring break stuff

Here's what we're looking at for the progress report: I've finished a more bulletproof and altogether better blopex, which keeps track of term frequency for tf/idf calculation, does much better post detection and extraction, creates its output in the format required by MEAD, and has more features built in to deal with most of the kinds of noise particular to blogs, such as timestamps, deleting followup navigation text, ignoring empty posts and guessing the likelihood of a sentence being garbage. This new version warrants a new name, blimp (BLopex IMProved).

I've also downloaded and tested cidr, the MEAD clustering tool, which we'd hoped would take care of document clustering for us. I've looked at the implementation, and cidr appears unsuitable for our clustering application. So we'll have to write our own document clustering module (or we could wait for the blog clustering project to finish, and use their system:) Or, we may want to have some sort of query-based clustering. After all, we care more that our clusters agree relative to the query, and less about agreement in feature spaces which have nothing to do with the query. This initial clustering is crucial, because without it, our summaries will be schizophrenic.

I've looked at the results of the crawl, and we've got about 7-10% of our blogs in Spanish. Filtering those out shouldn't be too tough, standard tf/idf methods should catch all of the offending pages.

Finally, I've started working on the add-ons necessary to get MEAD to do query-based summarization. That won't take too much, just a query generator. However, we ought to think of the best way to form queries, and which additional sentence-ranking methods would be useful to have for our methods.

Sunday, March 13, 2005

Summarizing

I've gotten the MEAD summarizer working on some hand-clustered groups of documents, but it's difficult to know how well it's actually doing on this data, and how well it will scale when we apply it to larger clusters. However, this gives us some ideas as to which sentence ranking and reranking methods we ought to try. Maybe it'll also be a good idea to try slicing up posts into paragraphs, and considering all of them as separate documents, since that could give us more flexibility to deal with posts which cover many topics.

Monday, March 07, 2005

Crawling done.

The crawling finished early this morning. We're now in possession of about 10,000 blog pages consisting of around 500 MB of HTML, downloaded slowly over about 4 days. Probably much more than we need - but now we can allow the luxury of throwing away blogs that are especially hard to parse.

The crawl frontier contained something like 50,000 pages when it stopped, too, so there's always more out there if we need it.

I'm running everything through blopex now. Then maybe I'll come up with a quick perl script that gives us some metrics on post length to tell how well the extraction worked. I'll be keeping the original data around as well in case we need to re-extract.

Saturday, March 05, 2005

docsent converter

We now have a converter for marking-up text files for input into MEAD: post2docsent.pl. The next step is to implement some new sentence-ranking metrics, maybe using some query-based rankers.

Sunday, February 27, 2005

gobbler

(Sort-of) finished a preliminary version of a blog-crawler which we're calling "gobbler". It crawls pages and downloads only links of the following forms:

  • www.livejournal.com/users/somebody
  • somebody.livejournal.com
  • somebody.blogspot.com
Of course links to specific posts &c are automatically stripped down to a link to the original blog. This won't catch those blogs that aren't hosted on standard sites, but we're hoping enough blogs are hosted on the standard sites that this limitation won't be serious.

Now to clean it up a little.

blopex updates

I was trying to make blopex smart enough to check which date format to choose by looking at which format was most common in a page, but that may backfire due to links to archived blog entries. So instead, it delimits on every date format it can find. Also, it tries to delimit on time, to account for multiple posts per day. This is about the final version, since I need to start working on sentence rankers for the summarizing tool, and post clustering.