webpagedb package


Package Contract

This package assumes responsibility for encapsulating information about each page in the space of pages. The unit of information in the webpage database is a document node which stores per-page information. The webpage database also stores the logical connections between the webpages with respect to various attributes.

The system recognizes distinct areas of user interest and generates clusters for each of these areas. In the current version, cluster creation is static and is done once only at startup based on the user's bookmarks. In other words, clusters and sub-clusters are created according to the hierarchical organization of the bookmarks. We are also restricting the initial set of pages to belong to only that cluster the user thinks it fits best into. However as new pages come in they may be placed in one or more clusters depending on their similarity to pages in the various clusters.

Package-Level CRC

Collaborators: Classes in package WebpageDB work with classes in the following packages:

* Advisor
The Advisor package is provided with a means to retrieve:
  1. A specified number of pages with highest hittage.
  2. A set of pages which are lighthouses (these are pages with high hittage and sufficiently represent all areas of KS).
  3. Any random page.
  4. A set of pages which have been fetched using a given fingerprint function (this is later used to fine tune the fingerprint function).

* GUI
The GUI package is provided with a means to retrieve all html pages (this is required on startup) or html pages that have changed in some way (e.g. new pages, deleted pages, updated pages).

* MapMaker
The MapMaker package is provided with a means to receive and send attribute values for a page.

Class-Level CRCs

The WebpageDB package contains the following classes:
* AccessHistory
* AccessSession
* AttributeTable
* ChangeHistory
* ChangeOnWeb
* Cluster
* ClusterID
* ClusterManager
* DocumentNode
* DocumentServer
* DocumentServerImpl
* PageID
* Similarity
* TagParser
* Word
* WordAttribute
* WordDistribution
Class AccessHistory

* Collaborators:
This class is not currently used. This class was intended to store the users page access information. This information would be updated here by a call from the Generalizer. The information would be used within webpagedb toward a crude user model.
* Responsibilities:
This class maintains access history for a page. Access history data consists of a Vector of AccessSession objects called accessHistory and a long called aggregateAcessTime which is the sum of all of the AccessSession object's durationOfAccess values. The constructor takes no arguments and only initializes these members. There are several methods to maintain and access this data described below.
* Variables and Methods:
private Vector accessHistory;
private long aggregateAccessTime;
public synchronized void logAccess(AccessSession as)
This method adds the AccessSession argument to the accessHistory Vector and updates aggregateAccessTime value.
public long getDurationOfAccess()
This method returns the aggregateAccessTime value.
public Vector getAccessLog()
This method returns the accessHistory Vector.
public synchronized void removeAccessLog()
This method removes all AccessSession objects from the accessHistory Vector and sets aggregateAccessTime back to 0.
public synchronized void removeFirstN(int n)
This method removes the AccessSession objects from the accessHistory Vector with indexes from 0 to n-1. Thus the oldest n accessSessions are removed from the Vector. If n is greater or equal to the size of the accessHistory Vector then all AccessSessions are removed from the Vector. In either case the aggregateAccessTime value is updated.
public synchronized long getDurationOfLastN(int n)
This method returns the sum of durationOfAccess values from the AccessSession objects in the greatest n index values of the accessHistory Vector. So this is the total access time for the most recent n AccessSessions. If the size of the accessHistory Vector is less than or equal to n then the total aggregateAccessTime is returned.





Class AccessSession
* Collaborators:
This class is not currently used. It was to be the data storage element of the AccessHistory class and only accessed by that class (which is also currently not used).
* Responsibilities:
This class maintains per session access details for a page. These details are a Date which is the time of the access and a long which is the number of seconds representing the duration of the access. The constructor requires Date and long arguments to set the timeOfAccess and duration OfAccess values respectively. There are also methods to return these values.
* Variables and Methods:
private Date timeOfAccess;
private long durationOfAccess;
public long getDuration()
This method returns the durationOfAccess value.
public getTimeStamp()
This method returns the timeOfAccess Date.





Class AttributeTable extends Hashtable
* Collaborators:
This class is instantiated once by the WordDistribution class. It is a static member of WordDistribution and is used only by WordDistribution.
* Responsibilities:
This final class is a Hashtable data structure. It is a table of WordAttribute objects which are indexed in the table by their String name. Each WordAttribute object has an int attributeWeight which is a measure of how important this WordAttribute is in determining the information contained in a database entry. The AttributeTable is instantiated and initialized when KS is started. It provides methods to add attributes to the table and to get an attributes weight passing a String name argument (index in the table).
* Variables and Methods:
public AttributeTable()
The constructor is called by WordDistribution to initialize the attribute table.
public void addAttribute(String name, int weight)
This method calls the HashTable's put method to put a WordAttribute object into the table with index String name. The WordAttribute added will have attributeWeight int weight passed in.
public WordAttribute getWordAttribute(String name)
This method takes the String name of an attribute and returns the corresponding WordAttribute object.
public int getAttributeWeight(String name)
This method returns the attributeWeight for the WordAttribute object with index String name.
private void initializeAttributeTable()
This method is called by the constructor. It uses the addAttribute method to put several standard WordAttributes in the initial AttributeTable.





Class ChangeHistory
* Collaborators:
This class is not currently used. It was intended to store users page changing activity. The Generalizer would call ChangeHistory to keep this information updated. The information would be used within webpagedb as a crude user model.
* Responsibilities:
This class maintains the history of changes data for a page. This data is a Vector of ChangeOnWeb objects called changeHistory. There is a long int called timeSinceLastChange which is the difference in milliseconds of the two most recent ChangeOnWeb object's timeOfChange values. Also, there is a double called changeRate which is the most recent ChangeOnWeb object's percentageChange value divided by timeSinceLastChange.
Note, changeRate is computed with timeSinceLastChange converted to hours. Thus, changeRate is the current percentageChange per hour for the page with this ChangeHistory. There is a constructor which takes no arguments and only initializes these values. There are several methods to maintain and access these values described below.
* Variables and Methods:
private Vector changeHistory;
private Long timeSinceLastChange;
private double changeRate;
public synchronized void logChange(ChangeOnWeb cow)
This method adds the ChangeOnWeb argument to the changeHistory Vector and updates timeSinceLastChange and changeRate values accordingly.
public long getTimeSinceLastChange()
This method returns timeSinceLastChange (in millis).
public Vector getChangeHistory()
This method returns the changeHistory Vector.
public synchronized void removeChangeHistoryLog()
This method removes all ChangeOnWeb objects from the changeHistory Vector and re-initializes timeSinceLastChange and changeRate values.
public synchronized void removeFirstN(int n)
This method removes the ChangeOnWeb objects from the changeHistory Vector with indexes from 0 through n-1. Note, these should be the oldest n ChangeOnWeb objects in the Vector. Also, if n is greater than the size of the Vector then all ChangeOnWeb objects are removed from the Vector. Note that changeRate and timeSinceLastChange are not effected unless all ChangeOnWeb objects are removed in which case changeRate and timeSinceLastChange are re-initialized.
public double getChangeRate()
This method returns the changeRate value (most recent percentageChange per hour)





Class ChangeOnWeb
* Collaborators:
This class is not currently used. It is the data element of the ChangeHistory class (which is also not currently used).
* Responsibilities:
This class maintains change details (per change) for a page. These details are a Date object which is the time of change and an int which is the percentage change value for this ChangeOnWeb. The ChangeOnWeb constructor requires Date and int arguments to set the timeOfChange and percentageChange values. The primary member of the ChangeHistory class is a Vector (called changeHistory) and is a Vector of ChangeOnWeb objects.
* Variables and Methods:
private Date timeOfChange;
private int percentageChange;
public int getChangePercent()
Method to return percentChange value.
public Date getChangeTime()
Method to return timeOfChange value.





Class Cluster
* Collaborators:
This class is the data storage element of the ClusterManager class. It is used only by the ClusterManager class.
* Responsibilities:
The webpage database stores documents (also sometimes referred to as pages) that have been fetched from the web and have been approved as being of interest to the user. These documents could be webpages or plain text pages.
For these pages to be rendered by the interface, the pages must be organized in such a way so as to provide a notion of locality or lack thereof.
The Cluster class represents a collection of pages that share some commonality. Each cluster is identified by a unique ClusterID. The Cluster may have some sub-clusters and definitely has a parent Cluster unless it is a top level Cluster. Each Cluster knows what it's own sub-clusters are and what it's parent cluster is - clusters are organized as trees, so a cluster cannot have 2 parents - and how to add/delete a sub-cluster.
The ClusterManager manages all Cluster objects. The requests to create/delete/move Cluster objects received by the ClusterManager are directed to the appropriate Cluster for execution.
One of the motivations to cluster the pages was to effect an efficient algorithm for defining relationships between clusters. Clearly computing the similarity of an incoming page with every single page in the database is infeasible for a large sized database. With clustering, we can reduce the search to only pages in clusters this page belongs to.The filters provide some hint as to where to place the page by providing in the form of Annotations a list of reference pages that this page successfully passed the similarity test with. The page is then associated with the cluster(s) that the reference page belongs to.
* Variables and Methods:
private Vector documents;
The documents in this cluster.
private Date dateCreated;
Date this cluster was created.
private String clusterName;
Nick name given to cluster by user.
private ClusterID parentClusterID;
ClusterID of the parent cluster.
private Hashtable subClusters;
The hashtables below has ClusterID as key and Cluster objects as value. They are the subClusters of this cluster.
private Vector mostCommonWordsInCluster;
Developing the idea of exemplary attributes.
public Cluster()
Constructor initializes all instance variables.
public ClusterID ID()
Returns the ClusterID of this cluster.
public void setClusterName(String clusterName)
Sets the cluster name - user may provide this through the bookmarks.
public String getClusterName()
Returns this clusters name.
public void setDateCreated(Date date)
Sets the dateCreated value for this cluster. Either from the bookmarks file or the current date.
public Date getDateCreated()
Returns the date this cluster was created.
public void addSubCluster(ClusterID newClusterID, Cluster newCluster)
Add the new cluster into this subClusters hashtable also set parent ID of newCluster to this cluster ID.
public void removeSubCluster(ClusterID clusterToRemove)
Remove clusterToRemove from this subclusters hashtable.
public Enumeration getSubClusters()
Return enumeration of this subclusters hashtable.
public ClusterID getParentID()
Return this parentID.
public void setParentID(ClusterID clusterID)
Set this parentID to clusterID.
public SortedList getFavorites()
Returns the private vector of documents for this cluster sorted by hittage value (highest hittage has lowest vector index).
public void addPage(DocumentNode documentToAdd)
Add a page to the cluster.
public void removePage(DocumentNode documentToRemove)
Remove a page from the cluster.
public Enumeration getPagesInCluster()
Return enumeration of the pages in the cluster.
public String toString()
Returns a concatenated string of the DocumentNodes in this cluster.





Class ClusterID
* Collaborators:
This class is only used by the cluster class to give each new cluster an identification value.
* Responsibilities:
With many clusters in the page database, identifying a particular cluster might be difficult. This class helps to identify a particular cluster in the webpage database. It generates integer IDs starting from zero until MAXINT. It provides methods to compare 2 IDs and get the integer representing the cluster.
* Variables and Methods:
private static int lastClusterID;
private int id;
public ClusterID()
id = ++lastClusterID;
public int getID()
return id;
public boolean equals(ClusterID otherID)
return (id == otherID.getID());





Class ClusterManager
* Collaborators:
There is only one ClusterManager instantiated when the database is initialized (DocumentServerImpl.initializeDB()). The ClusterManager is used by the TagParser to create clusters from the folders of bookmarks file. ClusterManager is also used in several methods of the DocumentServerImpl.
* Responsibilities:
The webpage database stores documents (also sometimes referred to as pages) that have been fetched from the web and have been approved as being of interest to the user. These documents could be webpages or plain text pages.
In order for these pages to be rendered by the interface, the pages must be organized in such a way so as to provide a notion of locality or lack thereof.
This class manages the dynamics of cluster creation, page placement within one or more clusters, page movement between clusters etc. Each cluster is represented by the Cluster class and identified by a unique ClusterID. The ClusterManager manages Cluster objects using a Hashtable to store them.
Each Cluster knows what it's own subClusters are and what it's parent cluster is (Clustering is purely a tree, so a cluster cannot have 2 parents) and how to add/delete a sub-cluster to itself. So all requests for adding/deleting/moving Cluster objects must be sent to the ClusterManager which will then direct the request to the appropriate Cluster for execution.
* Variables and Methods:
Hashtable clusters;
This hashtable contains all Cluster objects. ClusterID can be used to hash into it
public ClusterID createCluster()
Creates a new cluster whose location among other clusters is as yet undetermined. Creates a new Cluster object, generates a new ClusterID, and uses the ClusterID as hash key to place the Cluster object into the private Hashtable of Cluster objects.
public void putClusterIn(ClusterID outerID, ClusterID innerID)
Puts a cluster inside another cluster. Gets the Cluster objects corresponding to both ClusterIDs provided. Invokes the 'addSubCluster' method on the outer Cluster object placing the inner cluster in the outer cluster.
public ClusterID createClusterIn(ClusterID outerID)
Creates a cluster inside another cluster. Creates a new Cluster object and generates a ClusterID for it. Puts the Cluster object into the hashtable using the ClusterID as key. Invokes the 'putClusterIn' method of this class to insert the new Cluster object into the outer Cluster object.
public Cluster getCluster(ClusterID clusterID)
Looks up a cluster object given the cluster's ID. Uses the given clusterID to lookup the Hashtable of Cluster objects and returns the Cluster object found.
public Enumeration removeCluster(ClusterID clusterToRemove boolean childFlag)
If childFlag is set then enumeration of subclusters is returned. If childFlag is clear then all subclusters and pages in clusterToRemove are removed and the subclusters removed is returned.
public void addPageToCluster(ClusterID id, DocumentNode pageToAdd)
Adds a page to a cluster. Looks up the Hashtable of Cluster objects using the cluster ID of the target cluster. Then invokes the 'addPage' method on the target Cluster to add the specified page to that Cluster.
public void removePage(ClusterID clusterID, DocumentNode pageToRemove)
Removes a page from a cluster. Looks up the Hashtable of Cluster objects using the cluster ID of the target cluster. Then invokes 'removePage' method on the target cluster to remove the specified page from that cluster.
public void movePage(ClusterID from, ClusterID to, DocumentNode pageToMove)
Moves a page from one cluster to another. Looks up the Hashtable of Cluster objects using the cluster IDs of the source and destination clusters. Invokes the 'removePage' method on the source cluster to remove the page from the source cluster. Then invokes the 'addPage' method on the destination cluster to add the page to the destination cluster.





Class DocumentNode
* Collaborators:
* Responsibilities:
This class is responsible for storing and serving information specific to a single page in the space of pages. It stores the time the page was added, whether the page was bookmarked, when the page was added to the bookmarks, when the page was last modified, when some change was made to the object itself ( such as hittage for the page increased from 3 to 4), when the page was last visited by the user, when it was deleted by the Purger, which cluster(s) it belongs to, its hittage (defined as frequency with which the page is accessed by the user) the attributes of the html page it represents, its success as a seed site for the ferrets ( this is defined as the number of pages that were accepted into the database with this page as the seed site). It also stores how similar it is with respect to every other page in the cluster(s) it belongs to.
Although an exhaustive algorithm would compute similarity of every incoming page with every other page in the database and store it, this is clearly inefficient when the database becomes large. Therefore, each page's extent of relationships is restricted by the cluster(s) it belongs to. This helps in keeping the computation of similar pages more efficient. The filters provide some hint as to where to place the page by providing in the form of Annotations a list of KS pages that this page successfully passed the similarity test with. The page is then associated with the cluster(s) that the reference page belongs to.
Similarity is multi-dimensional in that many individual aspects/ attributes of a page can be used to give measures of similarity such as similarity with respect to content, similarity with respect to size and similarity with respect to number of images in the page. Each measure of similarity is represented by a single Similarity object in this class.
* Variables and Methods:
private Date addedOn;
when this document was added to KS
private boolean inBookmarks;
whether this document was bookmarked
private Date addedToBookmarksOn;
if the document was present in the user's bookmarks when this document was added to bookmarks
private Date lastModifiedOn;
when this document was last modified (will be useful when we have a browser at some later point)
private Date changedOn;
when this document was changed
private Date lastVisitedOn;
when this document was last visited
private Date deletedOn;
when this document was marked for deletion
private Vector belongToClusters;
this html document may have associations with multiple clusters this vector stores the clusterIDs of all such clusters
private int successAsSeedSite;
number of pages that were accepted into KS using this page as a seedsite
private int hittage;
an indication of how much the user liked it
private PageAttribute pageAttributes;
attributes of the html document this document node represents
private Similarity contentSimilarity;
stores how this page is connected to other pages in KS
private Similarity sizeSimilarity;
private Similarity imageNumSimilarity;
public DocumentNode()
public void belongToClusters(Vector belongToClusters)
Function : sets the clusters this DocumentNode belongs to Arguments : Vector of ClusterIDs
public void addedToBookmarksOn(Date someDate)
Function : sets the date this DocumentNode was added to the bookmarks Arguments : Date object
public void modifiedOn(Date someDate)
Function : sets the date when the webpage was last modified Arguments : Date object
public void lastVisitedOn(Date someDate)
Function : sets the date when the page was last visited Arguments : Date object
public void setAddedDate(Date someDate)
Function : sets the date when the page was added to KS Arguments : Date object
public void setDeletedDate(Date someDate)
Function : sets the date when the page was removed from KS Arguments : Date object
public void setAttributes(PageAttribute someAttributes)
Function : sets the attributes of the html page Arguments : PageAttribute object
public PageAttribute getAttributes()
Function : gets the attributes of the html page Return value: PageAttribute object
public Date getAddedDate()
Function : gets the date the page was added to KS Return value: Date object
public Date getChangedDate()
Function : gets the date when some component of this object was modified Return value: Date object
public Date getDeletedDate()
Function : gets the date when the object was deleted from KS Return value: Date object
public int getSuccessAsSeedSite()
Function : gets the number of pages that were accepted into KS by the filters using this page as seed site Return value: returns the number of pages that were accepted into KS using this page as seed site
public int getHittage()
Function : gets the number of times the user visited this page Return value: number of times the user visited this page
public void incrementHittage()
Function : increments the hittage to reflect another visit to this page
public void computeSizeSimilarity()
Function : given the cluster(s) the document belongs to compute the similarities with pages in same cluster(s) with respect to the size of the page. Then set the similarity object.
public void compupteContentSimilarity()
Function : given the cluster(s) the document belongs to compute the similarities with pages in same cluster(s) with respect to the content of the page. Then set the similarity object.
public void computeImageNumSimilarity(Similarity imageNumSimilarity)
Function : given the cluster(s) the document belongs to compute the similarities with pages in same cluster(s) with respect to the number of images on the page. Then set the similarity object.
public Similarity getSizeSimilarity()
Function : return the Similarity object which contains similarities of this page with pages in cluster(s) this page belongs to on the basis of size.
public Similarity getContentSimilarity()
Function : return the Similarity object which contains similarities of this page with pages in cluster(s) this page belongs to on the basis of content.
public Similarity getImageNumSimilarity()
Function : return the Similarity object which contains similarities of this page with pages in cluster(s) this page belongs to on the basis of number of images.





Interface DocumentServer extends Remote
* Collaborators:
* Responsibilities:
The webpage database stores documents (also sometimes referred to as pages) that have been fetched from the web and have been approved as being of interest to the user. These documents could be webpages or plain text pages.
This class provides a remote interface to the webpage server. The clients of the webpage database include the advisor, interface, mapmaker and the the generalizer.
The server provides methods to insert a document into the database, remove a document from the database, fetch a random document from the database, retrieve the user's most favorite documents, query the database for a document given a pageID, retrieve pages that are similar to a given page, retrieve pages that have been added/deleted/changed since a given time, and to retrieve all pages in the database.

* Variables and Methods:
public PageID insertNode(PageAttribute pageAttributes) throws RemoteException;
inserts dnode into the DB invoked by WebpageDB
public DocumentNode removeNode(PageID pid) throws RemoteException;
removes page with id = pid invoked by Purger
public DocumentNode getRandomDocument() throws RemoteException;
returns a random DocumentNode invoked by Advisor
public Vector getFavoritePages(int numberOfPages) throws RemoteException;
returns numberOfPages pages having the highest hittage invoked by Advisor
public DocumentNode getDocumentNode(PageID someID) throws RemoteException;
returns a DocumentNode given a pageID invoked by Advisor
public Vector getDocumentNode(int fpid) throws RemoteException;
returns a vector of DocumentNodes corresponding to documents that were added by the fingerprint function whose id is fpid invoked by Advisor
public Vector getExemplars() throws RemoteException;
returns a vector of DocumentNodes with the highest hittage and representative of every area of KS invoked by Advisor
public Vector getChangedPages(Date givenTime) throws RemoteException;
get pages that have been changes since time t invoked by Interface
public Vector getNewPages(Date givenTime) throws RemoteException;
get pages that have been added since time t invoked by Interface
public Vector getDeletedPages(Date givenTime) throws RemoteException;
get pages that have been deleted since time t invoked by Interface
public Vector getAllPages() throws RemoteException;
get all pages - returns a vector of type Document invoked by Interface
public DataInputStream getPage(PageID someID) throws RemoteException;
get reference to HTML object given pageID invoked by Interface
public Vector similarPages(PageID pid, int numberOfPages, int thisAttribute) throws RemoteException;
get "numberOfPages" pages that are similar to page with pageID pid with respect to attribute "thisAttribute"
public Vector similarPages(DocumentNode dnode, int numberOfPages) throws RemoteException;
get "numberOfPages" pages that are similar to page represented by DocumentNode dnode





Class DocumentServerImpl extends UnicastRemoteObject implements DocumentServer
* Collaborators:
* Responsibilities:
The webpage database stores documents (also sometimes referred to as pages) that have been fetched from the web and have been approved as being of interest to the user. These documents could be webpages or plain text pages.
This class implements the DocumentServer remote interface. The remote object DocumentServerImpl must be registered with a rmiregistry running on a specific port and host. All clients may contact this DocumentServer object by looking up the rmiregistry using Naming.lookup(" ... "). Once a reference to the object has been obtained from the rmiregistry, the client may invoke the remote methods of the DocumentServer object.
Its remote methods include insertNode() which is invoked by the Parser when a page is ready to be added to the webpage database. It also provides the removeNode() method which is invoked by the Purger to mark a page in the webpage database as deleted.
The getRandomDocument() method extracts a random document from the webpage database. This is invoked by the Advisor ( See Advisor overview for more details ) The getFavoritePages() method returns the first few pages which have the highest hittage. This is invoked by the Advisor. The getDocumentNode() method uses the pageID to look up the webpage database for a DocumentNode. The getDocumentNodes() method returns a vector of DocumentNodes which have all been passed by a particular filter.
This object also has a reference to the ClusterManager object which maintains information regarding the clustered organization of the database.
* Variables and Methods:
private Hashtable documents;
Contains document node objects must use PageID to hash into it
private Vector htmlPagesInKS;
store a set of references to html pages in KS this allows for quick search to determine if a new page being added to KS already exists
private final static int REGISTRY_PORT = 7000;
this is the port on which the registry runs
private static DocumentServerImpl obj = null;
ClusterManager manager;
public DocumentServerImpl() throws RemoteException
call the constructor of UnicastRemoteObject
public PageID insertNode(PageAttribute pageAttributes) throws RemoteException
Function : inserts a DocumentNode into the DB
Invoked by : parser
Arguments : PageAttribute object
Return value : PageID object
public DocumentNode removeNode(PageID pid) throws RemoteException
Function : removes pages with the specified id
Invoked by : Purger
Arguments : PageID object
Return value : DocumentNode which was removed
public DocumentNode getRandomDocument() throws RemoteException
Function : Returns a random DocumentNode Invoked by : Advisor
Arguments : PageID object
Return value : DocumentNode which was removed
Special note : Since we can't generate a random PageID instance, we generate a random number, pick the element at the position specified by the random number in an enumeration of the hashtable
public Vector getFavoritePages(int numberOfPages) throws RemoteException
Function : returns numberOfPages pages having the highest hittage
Invoked by : Advisor
Arguments : number of pages to be returned
Return value : Vector containing numberOfPages DocumentNodes
Special note : this is an o(n) search but we can improve this if we store the hittage values in sorted order
public DocumentNode getDocumentNode(PageID someID) throws RemoteException
Function : Returns a DocumentNode given a pageID highest hittage
Invoked by : Advisor
Arguments : PageID of document to be retreived
Return value : DocumentNode
public Vector getDocumentNode(int fpid) throws RemoteException
Function : Returns a vector of DocumentNodes corresponding to documents that were added by the fingerprint function with the given id
Invoked by : Advisor
Arguments : fingerprint id
Return value : Vector of DocumentNodes added by given fingerprintid
public Vector getExemplars() throws RemoteException
Function : Returns a vector of DocumentNodes with the highest hittage and representative of every area of KS
Invoked by : Advisor
Return value : Vector of DocumentNodes
public Vector getChangedPages(Date givenTime) throws RemoteException
Function : Returns a vector of DocumentNodes that have been changes since time t
Invoked by : Interface
Return value : Vector of DocumentNodes
public Vector getNewPages(Date givenTime) throws RemoteException
Function : Returns pages that have been added since time t
Invoked by : Interface
Return value : Vector of DocumentNodes
public Vector getDeletedPages(Date givenTime) throws RemoteException
Function : Returns a vector of DocumentNodes that have been deleted since time t
Invoked by : Interface
Return value : Vector of DocumentNodes
public Vector getAllPages() throws RemoteException
Function : Get all pages in DB
Invoked by : Interface
Return value : Vector of Document
public DataInputStream getPage(PageID someID) throws RemoteException
Function : Get reference to HTML object corresponding to the given pageID
Invoked by : Interface
Return value : DataInputStream corresponding to the HTML object
public Vector similarPages(PageID pid, int numberOfPages, int thisAttribute) throws RemoteException
Function : Get "numberOfPages" documents that are similar to document with given pageID
Invoked by : Interface,Advisor
Return value : Vector of DocumentNodes
public Vector similarPages(DocumentNode dnode, int numberOfPages) throws RemoteException
Function : Get "numberOfPages" documents that are similar to given DocumentNode
Invoked by : Advisor
Return value : Vector of DocumentNodes
public ClusterManager getClusterManager()
Function : Serves the client with a copy of the ClusterManager
Invoked by : Advisor
Return value : Copy of the ClusterManager
public void setClusterManager(ClusterManager manager)
sets the local copy of cluster manager
public String printDB() throws RemoteException
print all urls in the KS database





Class Link
* Collaborators:
* Responsibilities:
Maintains the links to other DocumentNodes and the strength of these links.
* Variables and Methods:
PageID id;
int strengthOfLink;
public Link(PageID someID, int strength)
The constructor initializes id and strengthOfLink with the parameters.
public int getStrength()
Returns strengthOfLink for this Link.





Class ListOfLinks
* Collaborators:
* Responsibilities:
Currently under revision.
* Variables and Methods:





Class PageID
* Collaborators:
* Responsibilities:
With numerous pages in the page database, it is essential to identify a particular page. This class helps to identify a particular page in the webpage database. It generates integer IDs starting from zero until MAXINT. It provides methods to compare 2 IDs, get the integer representing the page and to get the last ID that was allotted to any page.
* Variables and Methods:
private static int lastPageID;
private int id;
public PageID()
public int getID()
public boolean equals(PageID otherID)





Class Similarity extends Hashtable
* Collaborators:
* Responsibilities:
Similarity maintains links between documents within KS. Links are symmetric so if A has a link to B then B has a link to A. Also, the strength of link A-B is the same as link B-A. The links between documents in KS are stored in a hashtable of hashtables. The outer hashtable is indexed by PageID, say pageAID. The hashtable at the index pageAID is a hashtable of Links also indexed by PageID. Here at the index say pageBID is the Link (pageBID, int strengthOfLink).

* Variables and Methods:
void addSimilarity(PageID firstID, PageID secondID, int strength)
Adds two entries to Similarity hashtable. The first at index firstID the object is Link(secondID, strength) the second is at index secondID the object is Link(firstID, strength).

public void removeSimilarity(PageID firstID, PageID secondID)
Removes two entried from Similarity hashtable. The first at index firstID the object is Link(secondID, strength) the second is at index secondID the object is Link(firstID, strength).

public int howSimilar(pageID firstID, PageID secondID)
Returns the int strengthOfLink. This should be the same at outer index firstID, inner index secondID or vice versa. Returns -1 if no Link exists.

PageID[] moreSimilarThan(int threshold, PageID someID)
Returns an array of PageID's whose int strengthOfLink with someID is greater than or equal int threshold. If someID is not linked at all then null is returned.





Class TagParser
* Collaborators:
* Responsibilities:
Most browsers maintain bookmarks which allows the user to save URLs of pages that they found interesting and would like to revisit. The bookmarks of an organized individual is usually organized in a hierarchical manner and pages having similar flavour are clumped together into folders. The bookmarks could potentially have arbitrary levels of nested folders.
The bookmarks aid in jumpstarting the system's intelligence and is the first big clue it gets about the user's preferences. To get the best results out of the system, we ask that the users organize their bookmarks as best as they can. Novice users, however, might not have any bookmarks, in which case this step of parsing it can be completely bypassed.
The bookmark page, which consists of HTML tags and text, has the following two categories:
* Document Sites
* Folders
The tags to be parsed are the Data Listing tag <DL> and the Data Term tag <DT>. <DL> starts a new level in the hierarchy of entries in the bookmarks organization. </DL> marks the end of this level, <DT> indicates an entry in that level.
Within the page database, pages are organized into clusters. At the outset, a cluster corresponds to a single folder in the bookmarks page. Also each bookmarked site is placed in exactly one cluster. However, in general, a page may belong to one or more clusters. This class parses the bookmarks and organizes the information in Clusters.
Implementation overview: We maintain a stack that allows us to keep track of the current cluster in the hierarchy of clusters that is being acted upon.
Initialization : create new cluster and push on stack while next token != EOF, parse according to following algorithm The following algorithm is used to parse the bookmarks.
current tag = HREF tag a document node is created with the HREF value and added to the cluster at the top of the stack. previous tag = DT tag && current tag != A tag add a cluster current tag = DL tag push the most recently added cluster onto top of stack current tag = /DL tag pop the cluster stack current tag = /A tag && previous token = text we have a document name current tag = end tag && current tag != /A && previous token was text we have a folder name

* Variables and Methods:
private final static int REGISTRY_PORT = 7000;
public static void clusterize(InputStream stream, ClusterManager manager, DocumentServer remoteDocumentServerObj)
Function : parses bookmarks and stores into clusters
Arguments : input stream representing the bookmarks, reference to the cluster manager object, reference to the remote DocumentServer object
public static InputStream getStream(String webpage)
Function : returns a FileInputStream stream corresponding to a given webpage.
Arguments : String representing the webpage
Return Value : input stream representing the webpage,
public static void main(String commandLineArgs[])
Function : gets a reference to the remote DocumentServer object and calls clusterize() with the reference as one of the arguments.
Arguments : takes the bookmarks page name as argument





Class Word
* Collaborators:
* Responsibilities:
Contains the weight and number of occurances for each word.
* Variables and Methods:
public int indexInDictionary;
This Word's index in the KSDictionary Vector.
int numOccurances;
Vector WordAttributes;
of type WordAttributes with numOccurances elements
int weight;
computed using Vector above





Class WordAttributes
* Collaborators:
* Responsibilities:
The WordAttributes class provides objects to store the possible attributes of a Word OCCURANCE on a page. Each Word may have several occurances on a page. Each Word has a variable which is a vector of WordAttributes. For each occurance of a Word on a page there is a corresponding WordAttribute object stored in the Word's vector of WordAttributes.
* Variables and Methods:
Currently under revision.





Class WordDistribution
* Collaborators:
* Responsibilities:
This class maintains a SortedList called sortedWordList sorted on their int value indexInDictionary. Each WordDistribution object's sortedWordList will typically correspond to a specific DocumentNode. There is a constructor which instantiates an empty sortedWordList. There are two methods: public int addWord(Word w) and public int compareWordLists(WordDistribution wd) for maintaining and comparing WordDistributions.
* Variables and Methods:
private SortedList sortedWordList;
public int addWord(Word w)
This method performs a sorted insert of the Word w into this sortedWordList. The sorted insert is based on the Word's indexInDictionary returned by getIndex() and is really done by the sortedInsert method of the SortedList class.
The int returned is the index in the keys Vector where w's indexInDictionary was inserted or already exists (No insert is really done if w is already in the sortedWordlist). Also, note that the index in the keys Vector (where w's indexInDictionary goes) equals the index in SortedList (as a Vector) where Word w goes itself.
public int compareWordLists(WordDistribution wd)
This method compares two WordDistributions by comparing the keys Vector of their sortedWordList's (which is a SortedList). The int returned is the number of keys Vector entries in the longer of the two Vectors which do not match any of the entries in the other keys Vector. Thus a 0 is returned if and only if the two keys Vectors are identical (in length and entries).
Since the keys Vector for a sortedWordList is a Vector of indexInDictionary ints (the unique ints for Words in the KSDictionary Vector) then a greater int return value means a greater difference in the two WordDistributions (at least as far as the Words that they contain in their sortedWordList Vectors). Thus a return value of n always means that the longer of the two WordDistributions has n Words which the shorter WordDistribution does not have.





Message Interactions

Message interactions with other packages:

* with Advisor: "Coming soon"

* with Filter: "Coming soon"

* with GUI: "Coming soon"

* with MapMaker: "Coming soon"

Message interactions between classes in this package:

* "Coming soon"


last | | to sitemap | | up one level | | next