The engine includes the ferrets, filters, advisors, mapmakers, generalizers, page database, and related packages. It is intended to be a flexible, adaptive mechanism for fetching, analyzing, and organizing pages the user might like.
This task divides into two subtasks:
The brains of the operation:
Generalizers: Figure out what pages the user likes.
Ferret and Filter Advisors: Figure out what characteristics make those pages distinctive.
Ferret Advisors: Figure out where to get more pages with those characteristics.
The brawn of the operation:
Ferrets: Fetch a page from the web (or some other source).
Filters: Check if a fetched page has the desired characteristics.
Mapmakers: Analyze all other detectable characteristics of accepted page.
Page Database: Link accepted pages into the map of all accepted pages.
The generalizers (not yet implemented) and the advisors handle the first subtask, and the ferrets, filters, mapmakers, and page database handle the second subtask. (Currently there is only one mapmaker.) Ferrets fetch pages and test them for likely suitability, filters take a closer look at the ferreted pages to see if they actually belong in the space of pages, and mapmakers link the filtered pages into the space of accepted pages.
First, the ferret advisors examine the page database and get a list of pages that have high hittage, that is, pages the user has previously indicated strong interest in. A page's hittage depends on factors like the number of times the user visited the page, the amount of time the user spent on the page, whether the user indicated liking or disliking the page, hittage of pages related to this one, and so on. Generalizers compute hittage for all the pages in the database using data on the user's behavior extracted from the user interface.
Ferret advisors assume that high hittage pages are the sorts of pages the user would like to see in future. They divide these pages into categories, based on the contents of the pages. To do this, they use lighthouse pages, namely, pages in the database that have consistently high hittage. Lighthouses are assumed to be representative of the pages in their part of the space of pages.
The ferret advisors group pages by associating each page with its closest lighthouse. Then they choose a set of representative pages from each group. Representative pages are those that have (in decreasing order): a history of being used as a representative, high hittage, a large number of links, and a large size. Other properties may be added in future.
The ferret advisors take the sets of representative pages and extract distinctive features from each page. Features are combinations of attributes, or may even be the same as attributes themselves. Some features of a page are: the words in the title, the number of images, the language the page is written in, the page's URL, and so on. Each feature can be computed for any page. In addition, two features of the same type can be compared.
To extract a set of distinguishing features for each representative page, ferret advisors compute all possible features for a given page and all possible features for a random set of pages from the space of pages. They then compare the features of the given page against all the features computed for the random set. Some features of the given page will be very dissimilar to most, if not all, of the random pages, which makes those features distinctive for that particular page.
The ferret advisors use these distinguishing features to create tests, called ferret fingerprints. Each ferret fingerprint consists of a feature and the ID of the representative page that possessed that feature. These fingerprints are inserted into the ferret fingerprint pool.
Filter advisors follow a similar procedure for creating filter fingerprints. For them, however, the aim is a little different: filters collectively cover all parts of the space of pages, and all incoming pages are passed through them. If any one filter approves a page, it is passed on to the mapmakers for addition into the space of pages. Only if no filter approves a page is it discarded. Consequently, the page selection mechanism and nature of the fingerprint function differ for filters.
Filter advisors begin by using the lighthouses to select a set of representative pages for each part of the space of pages. Each advisor picks a lighthouse and gathers all the pages strongly linked to it. Then it whittles these down to a smaller set of representatives by keeping only those pages that are most similar to all other pages in the set. The filter advisors now should have sets of representative pages for each lighthouse (and, therefore, for each part of the space of pages).
The filter advisors then use these representative pages in feature extraction. Features that distinguish this entire set of pages from random pages in the space of pages are used to create filter fingerprints. Unlike ferret fingerprints, however, filter fingerprints do not consist of just one feature, but have a vector of features that are derived from the representatives (not just one representative page, as in the case of ferret fingerprints). These filter fingerprints are then inserted into the filter fingerprint pool.
Both kinds of advisors periodically examine the respective fingerprint pools and update them when necessary. Ferret fingerprints may get "exhausted"---for example, they may be used by many ferrets, and the ratio of the number of pages fetched to those finally approved may decrease over time. These are flushed from the pool. Similarly, filter fingerprints may have be created badly---for example, they may be too stringent and so may never pass any pages. Or the features used may not be the correct ones, resulting in pages being approved even though they do not belong to that part of the space of pages.
Based on user feedback (mainly hittage), the "bad" pages (those the user expressed a dislike for or those that do not get hit at all) are traced back to the filters that approved them. The corresponding filter fingerprint has its reliability reduced. If the reliability gets very low, the appropriate filter advisor redoes the fingerprint.
Each ferret randomly picks a fingerprint function from the ferret fingerprint pool and looks up the representative page associated with that fingerprint. All the links on that page are considered possible additions to the page database since they each neighbor a page already in the database. The ferret fetches all those pages and applies the chosen fingerprint function to each of them.
Application of the fingerprint function is described here in some detail, since it forms the basis for checking a page's suitability for addition to the space of pages. Recall that each ferret fingerprint is based on a distinguishing feature of some representative page, that is, the kind of page the user indicated a likeness for. Also recall that features can be computed for any page, and that features of the same type can be compared.
A ferret computes a feature of the same type as the one in this fingerprint for the incoming page. Then it compares the stored feature and the newly computed one. If the newly computed feature is sufficiently similar to the stored feature (that is, if it is above the similarity threshold associated with that fingerprint), the ferret approves the page.
Each ferret puts its approved pages in an cache, but even unapproved pages are cached since the same page might be required by other ferrets, and caching them will save the time spent on making a new http connection and refetching the page.
Each filter then examines all the cached approved pages. Again, all a filter does is apply its fingerprint function to the page to see if it meets its approval criteria. Here, however, application of the fingerprint function is slightly more complex since the fingerprint is now a vector of features. All those features are computed for the incoming page, and are compared to the stored features. The similarity values are compared to the corresponding threshold (there is a vector of thresholds in the fingerprint, one for each type of feature). If a majority of features are above their corresponding thresholds, the filter approves the page.
By the time a page is approved, a large number of its features have already been computed (this is part of the fingerprint testing process). Since features are based on attributes, this means that a large number of attribute values have already been computed. They are stored with the page as it passes from the ferrets to the filters to the mapmakers. There may, however, be other page attributes not yet computed for the page since they haven't yet become important in judging a page for acceptance. The mapmakers compute any such attributes and adds the page to the database.
The Page Database
The above design does not take into account extreme conditions, such as when the system starts up, when the system is killed, and so forth. It assumes a "steady state"---the system already has a bunch of pages and there has already been enough user interaction so that the current data for these pages accurately reflects the user's interests. Extreme conditions will be handled in future versions.
The ferrets, fingerprints, and their associated caches have been implemented but not yet tested. Fingerprints have been implemented, as have ferret and filter advisors and page selectors. Parts of the filter page selector, however, are incomplete. Also, the FeatureExtractor class has not been written at all.
The feature package is only partially implemented. Only one feature is currently in the implementation. Feature computation and comparison depend largely on attribute computation and comparison, which is done in the parser package. However, it is incomplete.