Next: Examining the Issues Up: Designing a System Previous: Known Space Ferrets and

Webpage Beauty Contests

Each webpage can be thought of as voting on the importance of all other webpages based on whether it links to them, or is linked to by them, or not. More generally, one page can be thought of as voting for another depending on whether the two pages have anything in common (proper names, keyword frequencies, and so on). In this way, Known Space is trying to find suitable contestants for a beauty contest that is being continuously judged by the webpages the user has already thought enough of to make them home documents.

Known Space can use lots of ways of estimating webpage interestingness: documents the user finds interesting either because they're in the user's webhome, areas the user has explicitly told the system are of interest, pages that people the user admires find interesting, and all the neighbors of those pages. Also, ``neighbors'' can be interpreted not just in terms of webpages, but in terms of other things too: for example, if a user expresses interest in whales, Known Space could interpret that as (some) interest in mammals, in sea creatures, in whaling, and so on (of course, to do so Known Space would first have to have a semantic net linking all these topics). Naturally, not all of these topics should be searched for since we're then back to the original problem of sifting wheat from chaff. Instead, Known Space should look for reinforcements of interest. So if at one time it decides that the user is interested in ships then that should activate whaling just a little. If it later deduces an interest in whales, that should increase the interestingness of whaling, and so on.

Also, any website that contains many interesting documents (however measured) should automatically become interesting itself. By extension, any sites that refer to that site, also become interesting (although less so), and so on.

Known Space is essentially building a semantic network of topics the user might find interesting. This is a web within the world wide web dedicated to one particular user's interests.

Here are some obvious characteristics of a webpage:

Do more pages point to it than it points to? (such a measure is an estimate of how seminal the page is)
How hard is it to get? (a measure either of the page's popularity or of its home's instability)
Who wrote it?
Who points to it?
Who does it point to?
How big is it?
How popular is it? (with the home documents, with other documents the user values, with other users)
How many other pages point to it?
How many pages does it point to?
How dense are the connections among its neighbors?
How far away is it? (measured in terms of shortest linkage distance from a home document)
What is it about?
What kind of site is it? (topic (if it's prose), language, ftp site, gopher site, usenet article or archive, pictures, sounds)
What is it related to? (to the user, to others the user values, to everyone)
What names appear in it? (the user's name, other personal names, software names, company names, country names, and so on)
What language is it in?
When was it created?
When was it last modified?
Where is it? (included in a site important to the user, others the user values, everyone)

Known Space might be able to ``explain'' its choice of interestingness for a webpage by displaying the set of activations along all these dimensions (arrayed in a two-dimensional map). Further, it can take that map and create a fingerprint to test for other pages like that page should it prove to be very popular with the user.

This classification scheme is like judging scientists for the Nobel Prize (who thinks they're good? what's their track record?). It's also like going to the library to find something interesting to read. And it's like judging other people (who do I know who vouches for this person in whatever implicit way?)

Since Known Space will inevitably make classification mistakes, it should be user-modifiable so that things that aren't close in keyword confluence, or any of the other automated measures used, but which the user feels are close in some semantic sense can be moved closer together. Further, the user should be able to blacklist a site and have that blacklisting appear in the user's local neighborhood map. Finally, the user should be able to exchange linkage information with other users so that the information mapping ability of each user in the group is magnified to the mapping ability of the entire group (in other words, users should have automatic word-of-mouth recommendations about new sites and semantic linkages). Localization, blacklisting, and information sharing all cut down on the cognitive effort of remembering long lists of unrelated things. Were we not creatures with extremely limited memories, all books would be one long sentence.

Next: Examining the Issues Up: Designing a System Previous: Known Space Ferrets and

Gregory J. E. Rawlins
1/13/1998