advisor package


Package Contract

The Advisor package creates fingerprints for the ferrets and filters. It collaborates with both the fingerprint package and the feature package.

There are two advisor classes (since they create two different kinds of fingerprints): the FerretAdvisor and the FilterAdvisor. The FerretAdvisor uses the FerretPageSelector class to fetch pages with high hittage. It is assumed that the user wants more such pages to be added, so FerretFingerprintFunctions must be created for pages like these.

The FerretPageSelector class uses a method provided by the WebpageDB to get the list of pages with the highest hittage. These are then divided into groups (based on content similarity). The way this is done is that each of these pages is associated with its nearest lighthouse. Pages mapping to the same lighthouse are considered to be in the same group. One or more representatives are chosen from each group, using criteria like the history of this page as a representative, its hittage, the number of links it has, etc.

The representative pages returned by the FerretPageSelector are mined by the FerretAdvisor for Features. This is done using the FeatureExtractor, which, given a web page, extracts significant features from it. The way this is done is that the FeatureExtractor randomly picks a sample of pages and computes all possible features for the representative page as well as the randomly selected page. Then, it compares the feature values from the representative page with the average Feature values from the random sample. If there is an appreciable difference, this is a distinguishing feature of the representative page.

The list of features returned by the FeatureExtractor is then used by the FerretAdvisor for fingerprint creation. The FerretAdvisor creates an instance of the FerretFingerprintFunction class for each prominent distinguishing feature of each representative page. Created fingerprints are then inserted into the ferretFingerprintPool to be used by ferrets.

The Fingerprints in the ferretFingerprintPool are periodically deleted if they have been used by enough ferrets. The FilterAdvisors work similarly, using the FilterPageSelector to extract representative pages from the web page database and the FeatureExtractor to get significant features for them. However, the page selection process differs a little.

The aim is to construct a set of filters so that they collectively cover all parts of the space of pages. The way this is done is that The FilterPageSelector uses lighthouses to find representative pages (unlike the FerretPageSelector, the FilterPageSelector does not select on the basis of hittage). Once these representative pages are found, the FeatureExtractor works on them. The list of distinguishing features is directly used by the FilterAdvisor in FilterFingerprintFunction creation. Note this difference occurs since a FilterFingerprintFunction consists of a list of Features and their associated thresholds, while a FerretFingerprint function consists of just one distinguishing Feature.

The FilterFingerprintFunctions can also be "redone" if they are not performing well. Their performance is measured by a reliability. This reliability is checked by the FilterAdvisor periodically. If the FilterFingerprintFunction is not performing well, it is recomputed - its thresholds are lowered or its distinguishing features recomputed. Every time the space of pages increases in size by 10%, the entire FilterFingerprintFunction pool is flushed (since they may not cover all of the space of pages anymore) and all FilterFingerprintFunctions are created again.

Package-Level CRC

Collaborators:
advisor works with the following packages: webpageDB, ferrets, filter, and fingerprint

Responsibilities:
Create appropriate fingerprint functions for the ferrets and the filters.

Class-Level CRCs

The advisor package contains the following classes:
* FerretAdvisor
* FilterAdvisor
* PageSelector
* FerretPageSelector
* FilterPageSelector
* FeatureExtractor

Class FerretAdvisor

* Responsibilities:
Create fingerprint functions to be used by ferrets.
* Collaborators:
Ferret fingerprint pool, Web page database
* Variables and Methods:
public DataPool ferretFingerprintPool;
public void CreateRequiredFingerprints
This method looks up the web page database and determines what classes of pages need to have fingerprint functions created for them. For each such class of pages, the CreateOneFingerprint method is called. The fingerprint functions are inserted into the ferretFingerprintPool.
private Fingerprint CreateOneFingerprint (PageID [] representativePages)
This method looks up the list of representative pages and tries to find features from these pages that can be used for generating fingerprints. For each distinctive feature of these pages, a FerretFingerprintFunction is created (see the Fingerprint package for details on how a ferret fingerprint is associated with a Feature).
public void flushFingerprintPool()
This method is called periodically to remove fingerprint functions that have been used by many ferrets. Presumably, these functions have been milked to get as many web pages as possible out of them.

Class FilterAdvisor

* Responsibilities:
Create fingerprint functions to be used by filters.
* Collaborators:
Filter fingerprint pool, Web page database.
* Variables and Methods:
public DataPool filterFingerprintPool;
public void CreateRequiredFingerprints
This method looks up the web page database and roughly divides the web page database into distinct classes of pages, based on content. For each class of pages, this method calls CreateOneFingerprint. The idea here is that there will then exist one fingerprint function for each distinct area of the space of pages. This runs as a thread that starts when the space of pages is less than some predefined size or the size of the space of pages is above the predefined size, but has increased by 10% (or whatever) since the last time this function ran.
public void CreateOneFingerprint (PageID [] representativePages)
This method looks at the representative pages and extracts a large number of features about these pages. These features are collectively used to create one fingerprint function. (Note the difference between what the ferret advisor and filter advisor do here: the ferret advisor create a fingerprint function based on just one feature; the filter advisor creates a fingerprint function based on a large number of features).
public void RedoFingerprintFunction (PageID [] representativePages)
Periodically, the filter advisor will look up the filter fingerprint function pool. Each fingerprint function has with itself a measure of reliability. If the advisor finds a very unreliable filter (that is, it keeps rejecting pages or the pages it gets have very low hittage even after a long time), this fingerprint function must be redone. The representative pages for the fingerprint function are p[picked up again, and their features and/or thresholds are recomputed (a different set of features could be used, or thresholds could be lowered or increased).

Class PageSelector

* Responsibilities:
Skip around the space of pages to select pages that can be used for making good fingerprint functions.
* Collaborators:
None (this class is never instantiated).
* Variables and Methods:
WebpageDB theDatabase; // whatever the class is called
abstract PageID[][] IdentifyNeededTypes()
This method will be implemented in subclasses.
abstract PageID[] getRepresentativePages(PageID[] classOfPages)
This method will be implemented in subclasses.

Class FerretPageSelector

* Responsibilities:
Skip around the space of pages and identify seed sites for each class of pages the user likes.
* Collaborators:
the FerretAdvisor class, Web page database
* Variables and Methods:
PageID [][] IdentifyNeededTypes()
This method will look up pages in the web page database that have high hittage. It then divides these pages into some classes based on content similarity and the location of the lighthouses relative to this page. These classes are returned to the calling function.
PageID [] getRepresentativePages (PageID [] classOfPages)
This function looks the the list of pages it has and returns a subset that is representative of all pages in the list. Typically, representative pages will be chosen on the following basis:
- history of this page as a representative page
- hit rate
- number of links
- size of the page
- percentage of page occupied by raw text

Class FilterPageSelector

* Responsibilities:
Skip around the space of pages and identify seed sites for each class of pages that exist.
Note: we try to generate at least one filter fingerprint function for each class of pages in the space of pages. A class of pages is a group of pages having related content.
* Collaborators:
the FilterAdvisor class, Web page database
* Variables and Methods:
PageID [][] IdentifyNeededTypes()
This method will go through the web page database and identify different classes of pages, making use of lighthouses. It tries to get classes of pages representing each part of the space of pages. Pages are classified based on content similarity.
PageID [] getRepresentativePages (PageID [] classOfPages)
Given a class of pages, this method looks through them to find a subset of pages that are representative of this class. Pages will be selected based on whether they exhibit typical features of the class.

Class FeatureExtractor

* Responsibilities:
Given a document node, identify features of this page that distinguishes it from other pages in the space of pages.
* Collaborators:
the FerretAdvisor and FilterAdvisor, the Feature set of classes.
* Variables and Methods:
WebpageDB theDatabase;
Feature [] getDistinguishingFeatures(PageID webPage)
This method computes all possible features of this web page (see the Feature package for the types of Features). For each feature, it also randomly selects some number of pages from the web page database and computes the same Feature for those pages. It compares those feature values with the feature value of this page and on that basis decides if this is a significant feature. It returns a list of such significant features.

Message Interactions

Message interactions with other packages
FerretAdvisor will call void FilterFingerprintPool.insert (FerretFingerprintFunction)
FilterAdvisor will call void FerretFingerprintPool.insert (FilterFingerprintFunction)
list enter With WebpageDB
The PageSelector and FeatureExtractor classes will use various methods of the webpageDB to access information:
* Methods dealing with access of document nodes
- method to access a document node given the page ID
- method to get a random document node in the space of pages
- method to return a vector of page IDs of the lighthouses in the space of pages
* Methods dealing with accessing information for a give document node. All these methods take a page ID and return the required attribute. The attributes that must be accessible will be
- title
- URL
- page content (entire HTML)
- number of links
- number of images
- number of applets
- number of frames
- URLs of links
- hittage
- language in which document is written
- meta tag information
- access history
- the fingerprint ID of the filter fingerprint that approved the page
- goodness as a seed site

list enter With Feature
The FeatureExtractor class will create various instances of Feature subclasses in the process of finding distinguishing features.
list enter With Fingerprint
The FerretAdvisor and FilterAdvisor will create various instances of the FerretFingerprintFunction and FilterFingerprintFunction.
Message interactions between classes in the package
list enter FerretAdvisor --- FerretPageSelector
FerretAdvisor will call PageID [][] IdentifyNeededTypes() to get pages that are required, divided into different groups.
FerretAdvisor will call PageID [] getRepresentativePages(PageID [] classOfPages) to get a set of representative pages for each group of pages for which it wants to create fingerprints.
list enter FilterAdvisor --- FilterPageSelector
Exactly the same as the above interaction.
list enter FerretAdvisor --- FeatureExtractor
FerretAdvisor will call getDistinguishingFeatures(PageID).
list enter FilterAdvisor --- FeatureExtractor
FilterAdvisor will call getDistinguishingFeatures(PageID).
last | | to sitemap | | up one level | | next