thinking like a detective


When reading spammed newsgroups, we're trying to find articles that interest us, while the spammers are trying to expose us to their junk and thereby drive us mad. The same problem occurs in many guises: spam email, junk snail mail, telemarketer phone calls, radio ads, tv ads, junk tv channels (or should that be simply, "tv channels"?). The problem is how to identify them without having to be exposed to their drivel first.

Imagine a room with a million chutes. Each chute is a content originator. Down any of those chutes anything may slide in. The problem is to barricade the chutes that are likely to only produce junk in the future.

The job is made harder because

* new chutes may be added at any time,
* some new chutes may deliver good stuff, and
* junk chutes are actively making themselves look as much like good chutes as they can.
So we can't simply wait a certain amount of time then barricade only the chutes that have led to junk in the past.

The task to identify chutes that the user

* definitely likes,
* definitely dislikes,
* doesn't yet know enough about to like or dislike.

Given an email message or a news article, imagine a huge table of attributes:

* who's it from?
* when did it come in?
* how big is it?
* what's its title?
* what's it about?
* what property does it not have? (for example, spammers have started making their userids nonsense words or non-English words, and disguising their point of origin to defeat current simple attempts to delete their articles.)
* how many other articles has that chute produced?
* how many of them have been bad?

The attributes link to a huge pool of things, each of which has at least one of the attributes in the table. The objective is to mark that pool of things red or blue. Each time something is marked red (for bad) everything with that attribute value is also marked a little red. Things that end up very red are all bad and all things that are related to those things in turn get marked a little redder.

Similarly, things can also be marked blue (for good). Everything that share attributes with that thing is also marked blue.

If a particular chute produces an article that the user has liked or disliked then all future articles from that chute should be tinged with the appropriate color.

Like detectives, we can look at the values of various attributes in this table as clues, from which we must derive circumstantial evidence in a case for or against a particular chute.

Note that clues must be linked to the articles they produce with different weights. Just because two news articles or email messages or webpages share the same length, for example, means little unless that property (length in this case) is distinctive among all articles that the user likes versus the ones the user doesn't like.

On the other hand, if:

* both articles come from the same chute,
* that chute has produced no good articles in some amount of time,
* that chute has produced a lot of articles, and
* a lot of those articles have been tested,
then the fact that the two articles are from the same chute is strong evidence that the new one is a waste of time and should be discarded without bothering the user to read it.

Chutes produce articles, articles have clues, those clues have weights. The task is to decide which chutes are good and which are bad by examining some or all of them for some amount of time each, which may vary depending on the chute and the previous ratio of good to bad articles.

We can also treat an entire newsgroup like a chute; it too is an information source. Imagine a collection of little thermometers next to each newsgroup showing volume and ratio of accepted to rejected articles.

Suppose we have n items and each item has m dimensions (or attributes), so we have an nxm table of (orderable) attribute values. Suppose also that k of the n items have been colored blue (for "good") or red (for "bad"). The task is to assign a color to all the other items based on how "far" they are from the marked items. Here's one algorithm:

For each of the m dimensions:
	Sort the n items on that dimension.
	For each marked item:
		For each of the n-1 other items:
			Using the color of the marked item,
			color each item proportional to
			its distance away along that dimension.

This costs O(mn(lgn + k)) = O(mn2) operations.

Or we could do the following:

For each of the (n-k) unmarked items:
	For each of the k marked items:
		Calculate the m-distance between each pair
		by using each of the m dimensions.

		Using the color of the marked item,
		color the unmarked item
		proportional to its m-distance away.

This costs O((n - k)km) operations, which is also O(mn2)

Note: this problem of detecting spam is very similar to the problem our immune system must solve every minute of every day. An organism, or even just a molecule, is floating in the bloodstream---is it good or bad? Is it part of me or is it some invader?



last | | to sitemap | | up one level |