feature package


Package Contract

The feature package provides classes for each feature type. Each feature subclass has methods to compute a feature value and to compare another feature of the same feature type.

It should be relatively simple to keep adding new features by subclassing. Code for computing and comparing features is abstracted into the Feature class, so all other classes can treat all features the same way. Features of a page are computed using a combination of the attributes of the page, such as word frequencies, number of images, interesting things about the URL, and so on.

Fingerprints consist of one or more features. When an incoming page is being checked for acceptability, the same feature is computed for the incoming page. The newly computed feature is compared with the stored feature (associated with this fingerprint function) and a measure of similarity returned. The measure of similarity is compared against the threshold for this fingerprint, and that decides whether the page will be acceptable.

Each Feature subclass implements the two abstract methods computeFeatureValue() and compareFeatureValue() of class Feature separately. Packages using class Feature, however, do not have to know how this is done. For example, the ferret fingerprints will have an instance of one of the Feature subclasses. To determine if a fingerprint function when applied on a page has a value above the threshold, the ferret will call the isPageGood() method (see the fingerprint package for details). The isPageGood() method will call the decidingFunction() method. This method creates an object of the Feature class and computes the feature value using the Feature's computeFeatureValue() method. It then calls compareFeatureValue() to compare the feature value of the new page and that which the ferret had in its fingerprint. The value returned by the compareFeatureValue() method is returned and compared to the threshold by the isPageGood() method.

Package-Level CRC

Collaborators:
Classes in package feature work with classes in the following packages:

* Advisor
* Fingerprint

Responsibilities:
The feature package defines the kinds of features a page could have along with methods to compute and compare values for those feature.

Class-Level CRCs

The feature package contains the following classes:

* FeatureType
* Feature
* various subclasses of class Feature for different kinds of features

Class FeatureType

* Collaborators:
objects of the Feature class and its subclasses
* Responsibilities:
provide a distinction between various types of features
* Variables and Methods:
int featureType;
This variable could be of some other type without affecting other classes later.
boolean sameFeatureType(FeatureType otherFeatureType)
Determine whether two feature types are the same.

Class Feature

* Collaborators:
Classes in the advisor package instantiate objects of the Feature class. Classes in package fingerprint use objects of class Feature in determining whether a page has certain features.
* Responsibilities:
This class will provide a framework for defining and using a feature of a web page. Subclasses of this class will actually implement the method specified here.
* Variables and Methods:
Object featureValue;
FeatureType type;
Object getFeatureValue()
Return the feature value. Can be interpreted appropriately by the calling method.
webPage computeFeatureValue(webPage)
This method looks up the webPage and computes whatever feature is needed. The computed value will be stored. This is an abstract method of class Feature; it is to be implemented by each subclass of class Feature.
double compareFeatureValue(Feature otherFeature)
If two features have the same type, this method compares their values and other information and returns a measure of similarity of the features. This is also an abstract method of class Feature.

Other Classes

Actual Features will be subclasses of class Feature. Each subclass will implement the computeFeatureValue() method to compute a specific feature. For example, one subclass might be something like URLStructureFeature, which implements computeFeatureValue() to get the URL of a page and create a vector which stores significant substrings of the URL and their positions in the URL. The same class would also implement the compareFeatureValue() method to compare the stored feature value with the given one (in this case, it would compare two vectors) and return a measure of their similarity.

Possible Feature subclasses:
* TitleFeature
* WordVectorFeature
* LinksFeature
* TextPercentageFeature
* HeadingWordsFeature
* LanguageUsedFeature
* ModificationDateFeature
* SpecialHTMLTagsFeature
* ImageAnalysisFeature
* ColorChangeFeature
* SimilarWordsFeature

More subclasses will be added since the filter fingerprint functions will use a large number of features to get an idea of what kinds of pages the user wants.

Message Interactions

Messages with other packages
* with Advisor:
The FeatureExtractor class will generate distinguishing Features for a document node and will thus create instances of each subclass of the Feature class.
* with Fingerprint:
The decidingFunction() method of each fingerprint will create an instance of the Feature class for each incoming page and will call compareFeatureValue() to compare the two Features.
Messages between classes in this package
* All Features will create an instance of FeatureType on creation.
last | | to sitemap | | up one level | | next