This package includes the Ferret, FerretArray and the FerretFingerprintPool class.
A Ferret fetches a page from the web given its URL, tests the page, and, if the page is approved, passes it on the filters.
The FerretArray creates 10 new Ferrets to start up the process. Then on each Ferret, before it dies, creates a new Ferret and the process goes on.
The FerretFingerprintPool is a store for the FerretFingerprintFunctions.
The FerretFingerprintFunction contains information related to the page to be fetched and also gives and idea about the kind of tests that must be applied to this page in order to pass it on to the next stage if approved. A seedsite is the term given to the basic representative page in the space of pages chosen to be picked up by one of the Ferrets. It is of type PageID and acts as a reference to a DocumentNode.
The DocumentNode has a list of PageAttributes for each page in the space of pages like URL of the page, number of links on the page etc. which are used by the Ferret.
The Cache contains all pages fetched from the web by the Ferrets. It is used as a lookup for future ferrets. Ferrets first examine this to check if the page need to be fetched from the web or it has already been fetched by a previous Ferret.
The FerretApprovedPagesCache contains all the pages approved by the Ferrets. This is looked up by the Filter, the next link in the chain of page analyzers.
This package allows the FerretAdvisor to insert FerretFingerprintFunctions into the FerretFingerprintPool. The Ferret then randomly picks up a FerretFingerprintFunction, gets hold of the seedsite contained in it (which is of type PageID). Using this PageID, it then requests a DocumentNode. The DocumentNode provides a list of PageAttributes (which include the links on the page). The ferret checks the Cache to see if all these URLs (links on the seedsite page) have already been fetched by previous ferrets. If not, then it fetches the pages and modifies them for base tags. (Some websites use relative addressing for images and links. It is therefore essential for the Ferret to insert a base tag which indicates the Home URL, that is, the base URL of the current page). The Ferret then dumps these pages into the Cache. The Ferret applies the corresponding FerretFingerprintFunction on each page to find out whether or not each page fetched should be accepted. If yes it stores the page in the ApprovedPagesCache. Else it rejects the page.
fingerprint package: FerretFingerprintFunction
advisor package;: FerretAdvisor
cache package: Cache, FerretApprovedPagesCache
webpageDB package: DocumentServerImpl, DocumentNode, PageID
parser package: PageAttribute, AddPageContent, AddPageURL, AddSeedSite