the datastream


Collector simpletons drag data into the system from various information sources. As this information flows down the data stream, various other simpletons modify and manipulate it. Mapper simpletons at the end of the stream map the information to a pseudo-display. Information brought into the data stream leaves it only when purger simpletons delete it.

A simpleton marks a page or cluster by adding an attribute-value pair, the current time, and its identifier. It also adds the page's identifier to its personal pool of marked pages, which is kept separately and is indexed by simpleton identifier. Thus each simpleton has an "upstream" set of pages it roams over and a "downstream" set of pages it subsequently marks. Each simpleton consumes pages in its upstream pool and delivers pages to its downstream pool. This divides the space of pages into a collection of overlapping pools of pages: the general page pool of all pages fetched by the collectors plus all the pools of pages marked by each internal simpleton.

All simpletons run in parallel in their own threads and have the same structure:

* an upstream pool of pages or page clusters to roam over,
* a roaming algorithm determining how to choose the next page or cluster to examine,
* an algorithm to check a page's "interestingness,"
* an algorithm to decide what attributes should be attached to any interesting pages found.

Only the collectors are superficially different, but since we can think of them as dragging back pages and marking them locally (thereby adding to the general pool of all pages) there is really no difference at all.

Each of these simpleton algorithms can be dynamically modified using the user model, which tells each simpleton how to modify its behavior using a specification object. Each simpleton fetches the relevant specification to update its behavior. There is a separate specification for each simpleton in the data stream.

The attributes that simpletons attach to pages might be anything since they're dependent on both the page and the simpleton. An attribute may even be the null attribute of "I looked at this page but couldn't find anything useful to say about it" (which is different from an attribute like "I looked at this page and found it dull"). Simpletons may implicitly communicate with each other through the attributes they each leave on the pages they visit (that is, they may also mark the page marks, not just the pages themselves).

Each page that is high in some measure of interestingness would appear in a pool of pages of that type. The simpletons may stay with the pages in a particular pool, or dive into the general pool, or start in a specific pool then branch out into the general pool, and so on. (Or should we force each simpleton to read and write only to its own datapools?) Consequently, linking classes of simpletons is a simple matter of telling each simpleton where to start roaming.

Collectors, for example, would be biased to pay more attention to pages that the collector advisors rate highly to find seed sites. Filters would be biased to pay attention to pages that the pollsters rate highly to find significant (or "lighthouse") pages to extract fingerprints from. Neither the collectors, the filters, nor the pollsters need know anything about any of the other families of simpletons---their only concern would be which page pools to roam over.

All simpletons can examine both the general page pool and each particular simpleton's pool of marked pages to see which simpletons have marked which pages, and also which pages have been marked by which simpletons (and when). Thus, simpletons can implicitly vote on which other simpletons are doing a good job (relative to the current simpleton) by the marks they leave or don't leave on the pages also marked by those other simpletons. Which means that simpletons could, potentially, evolve. (Or should we disallow simpletons from reading other simpletons' datapools?)

This architecture might be usable on large spaces of pages (say, for example, a million pages). The simpletons probabilistically hop from page to page all over the dataspace. Those pages that start getting a little attention from several simpletons would automatically get more and more attention as each simpleton tries to see if those pages can be marked in its special way (whatever way that happens to be). Conversely, any simpletons that happen to mark pages that later become interesting to many other simpletons could get more reward than those that don't. Consequently, both the most interesting pages and the simpletons most capable of finding those pages would be implicitly rated. The result is a space that may never be fully explored but which is always at least partially mapped.



last | | to sitemap | | up one level | | next