Next: Webpage Beauty Contests
Up: Designing a System
Previous: The Known Space Interface
Although the algorithms of today's search engines are proprietary, it's
reasonable to guess that they are probably doing a keyword analysis of all
webpages, perhaps using a thesaurus to find similar words, then linking
one page to another based on keyword confluence and (presumably) the
presence of direct linkage. Were they not doing something like that they
couldn't provide a ``find similar pages'' button. This is a step in the right
direction but there is more that could be done to improve searches.
Today's search engines do not exploit a lot of information about the user's
interests--namely, the knowledge of what sites the user has already
found interesting enough to put into a bookmarks file. They also pay no
attention to the sequence of search requests from a particular user.
Known Space tries to exploit those kinds of contextual information to
find more sites that the user is likely to find interesting. Currently,
sites have high profile only if they are very popular across a broad spectrum
of the web's population. That's fine for the lowest common denominator,
but it is hardly suited to varied tastes. There's no reason sites
couldn't eventually be ranked in popularity within subdomains of users
(say, all computer science professors, or all race car drivers)
instead of the single domain of all users.
In what follows, call the documents already in a user's bookmarks file
the user's home documents. Call every other document an offsite
document. Call the offsite documents that any home document directly
points to, or is pointed to by, the nearest neighbors. Call all
offsite documents within a few links of at least one the home document
the near neighbors.
The object is to find and evaluate promising offsite candidate for inclusion
in the user's home documents. A document with a high score under the
following measures is assumed to be interesting to the user since
it is related in some way to some of the documents the user has already
valued enough to save links to them.
Here are some first steps toward personalizing searching and mapping by
estimating a particular user's interest in webpages:
- Stage I
- Every nearest neighbor gets ten points for each link it contains to a home
document. Repeated hits to the same home document from the same nearest
neighbor document count as one hit, but repeated hits to different home
documents from the same nearest neighbor document count as multiple hits.
Any document that neighbors one of those nearest neighbors gets one point
for each link to a distinct nearest neighbor. Any neighbors of the
neighbors get a tenth of a point, and so on in decreasing factors of ten.
More generally, the user could be polled by the system to estimate levels
of interest within the home documents. Some documents, or document
clusters, may be much more important to the user than others. Further, that
evaluation should also be biased dynamically depending on where (that is,
in which document clusters) the user spends the most time. This evaluation
should then bias the attachment of weight to neighboring but offsite webpages.
Notes:
-
This neighbor measure could split into a 2-d vector who's second half is
a measure of blacklisted sites as signs of negative interest. Neighbors of
blacklisted sites would go down in value just as neighbors of home documents
go up in value.
-
Should the neighbor measure distinguish between incoming and outgoing links?
That is, if a home document points to an offsite document should that offsite
document be evaluated higher than if it pointed to the home document rather
than the other way around? Also, if two sites point to each other, should
they get more points than otherwise?
- Stage II
- Estimates of semantic linkage based solely on direct weblinkage can't be very
accurate. To improve accuracy further the system must pay attention to the
structure of the documents themselves. As a first cut at this, the system
might collect all names referenced in the home documents (Microsoft, Sun
Microsystems, Harry Turtledove, or whatever) and save those names
(appropriately weighting them based on the present weighting of the
documents containing them plus their frequency of occurrence). Then it can
search through the offsite near neighbors of the home documents for
mentions of those names. As before, documents that mention those names get
10 points, and documents that mention documents with those names get 1 point,
and so on.
- Stage III
- Checking direct linkage and names will give some information about semantic
linkage, but Known Space must press on. The system could analyze the
home documents in more detail, creating histograms of word frequencies, paying
particular attention to uncommon words (it uses a dictionary coupled with a
thesaurus to determine word groupings). After tossing out extremely common
words (and, but, so, you, and so on) it boils down those histogram into
general classes of documents (again weighting those classes based on the
averaged weight of the corresponding documents). It simplifies these
classes further (thus discarding even more information) until it has a fairly
small fingerprint function (experiment will have to determine how small this
fingerprint needs to be for fast testing and reasonable discrimination).
Then it searches the near neighbors of the home documents for documents
that fit the fingerprints of the most interesting (most highly weighted
so far) home documents.
The sophistication of the fingerprinting can be arbitrarily extended
depending on how much processing power the system has at its disposal. For
example, many pages are mainly lists of links, such pages should show up in
a sophisticated fingerprinter as being link-heavy, while others might be
picture-heavy, prose-heavy and so on. Webpages can also be more or less
current--currency is easy to determine from date-last-modified.
- Stage IV
- The system should also allow query-by-example, where the user specifies some
sites and anti-sites as a way of implicitly defining the current interest.
Further, if the user chooses to divulge the information, Known Space could also
know something about the user's predilections: society memberships,
hobbies, research interests, and so on. Finally, the system should use its
previously created fingerprints of home webpages to derive information about
the user's interests.
Whenever the user stops surfing and frees up the computer and the
communications line, Known Space could autonomously initiate searches at one
of the commercial search engines with any and all of the above derived
information: sublists of the user's declared list of keywords and interests,
sites similar to popular home sites, sites containing names popular with
the home sites, sites with similar word-histograms as popular home sites,
near neighbors of any of those sites, and so on, in increasing order
of detail. Once these webpages arrive, Known Space would process them in the
usual way as sketched above.
This autonomous behavior would have to be fairly moderate otherwise search
engine companies may disallow future searches from that user (the technology
to identify users may soon be in use on most browsers). Known Space could
also create webworms to do searches if denied access to commercial search
engines.
These stages are in order of increasing computational complexity, and
should be done in that order. If the system has little time to make an
estimate of which pages to look at next, Stage I techniques only should
be used, if it has a little more time then it should use Stage I and Stage II,
and so on.
Next: Webpage Beauty Contests
Up: Designing a System
Previous: The Known Space Interface
Gregory J. E. Rawlins
1/13/1998