next up previous contents
Next: Webpage Beauty Contests Up: Designing a System Previous: The Known Space Interface

Known Space Ferrets and Filters

Although the algorithms of today's search engines are proprietary, it's reasonable to guess that they are probably doing a keyword analysis of all webpages, perhaps using a thesaurus to find similar words, then linking one page to another based on keyword confluence and (presumably) the presence of direct linkage. Were they not doing something like that they couldn't provide a ``find similar pages'' button. This is a step in the right direction but there is more that could be done to improve searches.

Today's search engines do not exploit a lot of information about the user's interests--namely, the knowledge of what sites the user has already found interesting enough to put into a bookmarks file. They also pay no attention to the sequence of search requests from a particular user. Known Space tries to exploit those kinds of contextual information to find more sites that the user is likely to find interesting. Currently, sites have high profile only if they are very popular across a broad spectrum of the web's population. That's fine for the lowest common denominator, but it is hardly suited to varied tastes. There's no reason sites couldn't eventually be ranked in popularity within subdomains of users (say, all computer science professors, or all race car drivers) instead of the single domain of all users.

In what follows, call the documents already in a user's bookmarks file the user's home documents. Call every other document an offsite document. Call the offsite documents that any home document directly points to, or is pointed to by, the nearest neighbors. Call all offsite documents within a few links of at least one the home document the near neighbors.

The object is to find and evaluate promising offsite candidate for inclusion in the user's home documents. A document with a high score under the following measures is assumed to be interesting to the user since it is related in some way to some of the documents the user has already valued enough to save links to them.

Here are some first steps toward personalizing searching and mapping by estimating a particular user's interest in webpages:

Stage I
Every nearest neighbor gets ten points for each link it contains to a home document. Repeated hits to the same home document from the same nearest neighbor document count as one hit, but repeated hits to different home documents from the same nearest neighbor document count as multiple hits. Any document that neighbors one of those nearest neighbors gets one point for each link to a distinct nearest neighbor. Any neighbors of the neighbors get a tenth of a point, and so on in decreasing factors of ten. More generally, the user could be polled by the system to estimate levels of interest within the home documents. Some documents, or document clusters, may be much more important to the user than others. Further, that evaluation should also be biased dynamically depending on where (that is, in which document clusters) the user spends the most time. This evaluation should then bias the attachment of weight to neighboring but offsite webpages.

Notes:

Stage II
Estimates of semantic linkage based solely on direct weblinkage can't be very accurate. To improve accuracy further the system must pay attention to the structure of the documents themselves. As a first cut at this, the system might collect all names referenced in the home documents (Microsoft, Sun Microsystems, Harry Turtledove, or whatever) and save those names (appropriately weighting them based on the present weighting of the documents containing them plus their frequency of occurrence). Then it can search through the offsite near neighbors of the home documents for mentions of those names. As before, documents that mention those names get 10 points, and documents that mention documents with those names get 1 point, and so on.

Stage III
Checking direct linkage and names will give some information about semantic linkage, but Known Space must press on. The system could analyze the home documents in more detail, creating histograms of word frequencies, paying particular attention to uncommon words (it uses a dictionary coupled with a thesaurus to determine word groupings). After tossing out extremely common words (and, but, so, you, and so on) it boils down those histogram into general classes of documents (again weighting those classes based on the averaged weight of the corresponding documents). It simplifies these classes further (thus discarding even more information) until it has a fairly small fingerprint function (experiment will have to determine how small this fingerprint needs to be for fast testing and reasonable discrimination). Then it searches the near neighbors of the home documents for documents that fit the fingerprints of the most interesting (most highly weighted so far) home documents.

The sophistication of the fingerprinting can be arbitrarily extended depending on how much processing power the system has at its disposal. For example, many pages are mainly lists of links, such pages should show up in a sophisticated fingerprinter as being link-heavy, while others might be picture-heavy, prose-heavy and so on. Webpages can also be more or less current--currency is easy to determine from date-last-modified.

Stage IV
The system should also allow query-by-example, where the user specifies some sites and anti-sites as a way of implicitly defining the current interest. Further, if the user chooses to divulge the information, Known Space could also know something about the user's predilections: society memberships, hobbies, research interests, and so on. Finally, the system should use its previously created fingerprints of home webpages to derive information about the user's interests.

Whenever the user stops surfing and frees up the computer and the communications line, Known Space could autonomously initiate searches at one of the commercial search engines with any and all of the above derived information: sublists of the user's declared list of keywords and interests, sites similar to popular home sites, sites containing names popular with the home sites, sites with similar word-histograms as popular home sites, near neighbors of any of those sites, and so on, in increasing order of detail. Once these webpages arrive, Known Space would process them in the usual way as sketched above.

This autonomous behavior would have to be fairly moderate otherwise search engine companies may disallow future searches from that user (the technology to identify users may soon be in use on most browsers). Known Space could also create webworms to do searches if denied access to commercial search engines.

These stages are in order of increasing computational complexity, and should be done in that order. If the system has little time to make an estimate of which pages to look at next, Stage I techniques only should be used, if it has a little more time then it should use Stage I and Stage II, and so on.


next up previous contents
Next: Webpage Beauty Contests Up: Designing a System Previous: The Known Space Interface
Gregory J. E. Rawlins
1/13/1998