defining clusters


Intertwingularity is not generally acknowledged---people keep pretending they can make things deeply hierarchical, categorizable and sequential when they can't. Everything is deeply intertwingled.
Ted Nelson

The system has to support many operations that are not easily supported if all page processing can only happen a page at a time. For example, it has to let the user:

* search for pages satisfying arbitrary predicates,
* search for pages by example,
* search within a space of collections of pages, and
* organize and reorganize pages and collections of pages.
 
It also has to, if not support, at least not disallow
* email and news handling, and
* any other mass page manipulation we haven't thought of yet.
 
Finally, for its own internal purposes, it has to
* group related pages where the relation defined on the pages is dynamically decided by the functioning of the system itself.

What it needs is some way to group related pages together for simultaneous analysis or display. It needs clusters.

Clusters

A cluster is a collection of related pages and clusters. Pages and clusters are in a cluster if they share some characteristic. That shared characteristic may be that they all have a certain attribute, or that they all have a certain value for a certain attribute, or a certain range of values for some combination of attributes, or some even more general predicate.

Predicates

The predicate that a group of related pages or clusters inside a cluster all satisfy may be simple equality, in which case the relationship between the objects is that they all share the same attribute (or attribute value), but it may just as easily be that their attribute values for a particular attribute are all equal to within a given tolerance---which gives us the usual notion of "cluster" (that is, these things are all "near" each other using some given metric).

The predicate, however, may be even more general: it could be "these things are the result of a search performed by the user", or "these things have been touched once each on each of the past five Tuesdays", or any other arbitrary predicate the system cares to track.

Since clusters can be defined by arbitrary predicates, every subset of pages and clusters could itself be a new cluster, but not all possible clusters are interesting.

Interesting Clusters

It is impossible (and useless) to classify pages in all possible ways, and it is nonsensical (and restrictive) to try to classify them in only one way (which is the traditional "let's build one big hierarchy" approach of today's desktops). Instead, clustering should be dependent on predicates the user finds important, or predicates the system thinks are predictive of what the user thinks important.

Interesting clusters are those that the system creates either for its own internal purposes or to present information to its user as the result of a user-initiated search or other request (for example, mail invocation).

Clusters and Hierarchy

A page may have many attributes, and there will be many ways of classifying it based on those attributes. Pages can be in a particular cluster if they all satisfy some predicate defined on their attributes. Thus, two clusters can share pages in common, yet neither cluster need necessarily contain the other.

Consequently, clusters aren't the same as directories or folders. Clusters are more like "categories" or "kinds": things are in a cluster if they are related in some fashion---they all satisfy some predicate defined on the set of attributes.

Cluster Attributes

Clusters should have attributes, just like pages. Of course, those attributes need not be the same attributes as pages, but some of them will be the same and some will be different.

For example, both clusters and pages should have a "hittage" attribute specifying how hot it is (how often the user looks at it). (Perhaps a cluster's hittage is a weighted average of its pages' hittages.) This is one attribute that both pages and clusters will have in common. Another might be name. A third might be size (where the size of a cluster might be taken as either the number of pages in it or the total size of those pages, or both, or some other measure).

On the other hand, one attribute that pages and clusters can't share is "cohesiveness" (how similar the pages in a cluster are to each other) since a page, by itself, can't be "cohesive". Another "cluster-only" attribute would be what predicate the attributes of the objects inside it satisfy. A third would be the list of pages and clusters inside the cluster.

For a page (or cluster) to end up in a cluster its attributes would have to satisfy some predicate that all the other pages and clusters inside the cluster satisfy, just as to be bound to a simpleton a page's attributes have to satisfy some predicate.

Note that cluster attributes aren't the same as the predicate that the pages and clusters inside the cluster satisfy, although that is a necessary attribute for the cluster to have; nor are they the same as the attributes that the pages inside the cluster share, although that is a reasonable attribute for a cluster to have, too. Cluster attributes are just like page attributes---they're properties of the cluster as a whole.

Ordered Clusters

The enclosed pages and clusters inside a cluster may be ordered in some way (for example, the hotlist cluster might have an ordering imposed on the pages in it; similarly the email cluster could be ordered in several ways, most usually by time) or they may be unordered (which would be the norm). If there is an ordering, the ordered elements should not be aware of it. The ordering should simply be yet another attribute of the cluster that contains them.

Dynamic Clusters

Clusters should be creatable on the fly, either in response to the user's request to do a search, or in response to the system's internal need to organize or reorganize collections of pages. Clusters should be cached and recalled depending on future access to increase responsiveness at the expense of space.

Creating a cluster can be as simple as searching an info pool for pages and clusters that satisfy a given predicate, or it may be a much more complex process involving many simpletons working together to analyze multiple pages to decide which ones should go into the new cluster.

There's never any need for the system to monkey with any old clusters; it could just add new ones. The old useless clusters can die over time if they never find a new use. Clusters may even be merged or split, if necessary, simply by altering the predicate the clusters satisfy. (Although note that these actions may not be trivial to code! They are simple only when the clusters are being merged or split by the user, since to merge two clusters, say, the system would merely have to attach the user-given name for the new cluster to the new cluster, put all the objects previously in the old clusters into the new one, and mark each one as belonging to the new cluster).

Cluster Maintenance

When a cluster is created, all the current pages and clusters whose attributes satisfy its predicate are put into the cluster but that does not account for future pages and clusters. As they are installed into the system they must each be tested to see if they satisfy the predicate of the first mentioned cluster. Thus the more clusters there are, the more effort is involved in adding a new object to the system and maintaining consistency. As the number of clusters grow, so will the number of predicates, and with them will grow the effort to integrate every new object, which likely means that there should be a Predicate object to keep track of and structure all the predicates so that the computational effort to check all the current predicates in the system for each new page or cluster is minimized.

Read-only Attributes

The main attribute any cluster has is the predicate that the attributes of pages and clusters inside the cluster satisfy. This attribute must be made read-only since if a simpleton were to change it, it would change the fundamental identity of the cluster itself and that would lead to chaos.

More generally, this points to the need to have some attributes be read-only, and others be modifiable. Examiner simpletons may have privilege to write (or add) certain attributes, but not others; similarly, clusterer simpletons may have privilege to write (or add) some attributes but not others---one of the writeable ones will be the predicate attribute just alluded to. Once written, it must from then on become read-only to all simpletons.

Similarly, collector simpletons must add certain attributes on fetching a page that only they can know, and so only they can add such attributes (for example the last modification date of a webpage). Once added, those attributes should become read-only for the entire system.

Every simpleton can read any attribute (at least so far; we haven't started to worry about privacy in multi-user systems yet) but not every simpleton can write every attribute.

This means that we must add access controls to page attributes. Various levels of access would be granted to various classes of simpletons (and, perhaps in future) only at various stages of page processing. This access policy would have to be implemented by the pools and might not be hard to do with a global access control list for each pool and checking of access privilege on simpleton registration. The getPage() or getCluster() requests of any simpleton failing the access check would then simply be ignored.

Access Control

Allowing clusters into the system introduces data management and page locking problems. Presently the rule is: "one simpleton, at most one page". No simpleton has access to more than one page at a time and that eliminates the possibility of deadlock and also the possibility of data corruption. Allowing clusters, and then allowing simpletons to bind to clusters, completely breaks this model, and suggests another question: should simpletons be allowed to bind to a cluster and a page, or to multiple non-clustered pages or clusters, simultaneously?

To make clusters safe, we have to reduce the degrees of freedom. We must either lock all the pages in a cluster currently bound to a simpleton against all other simpletons (which will be problematic for very large clusters from an efficiency, a responsiveness, and from a deadlock point of view) or we must control page access on a finer-grained level.

First, instead of allowing simpletons to attach to any number of clusters, let's restrict them to attaching to only one cluster at a time, just as now they can only attach to one page at a time.

This still leaves the potential deadlock and corruption problems of pages or clusters that happen to be in two different clusters that two separate simpletons are currently bound to.

The notion of read-only attributes shows us the way out. To eliminate the sharing problem we allow simpletons that are binding to clusters to acquire access to the pages and clusters inside the cluster, but only for reading. That eliminates the possibility of two simpletons wanting to write to the same page at the same time.

Note that the binding simpleton can still read and write the cluster attributes as much as it wants. Those attributes attach to the cluster as a whole, not the pages and clusters that make up the cluster and, at present, that cluster is bound to the simpleton so no other simpleton currently has write access to the cluster's attributes. (Note though that, as pointed out above, even some of the cluster attributes, like the cluster predicate, for example, may be read-only to the binding simpleton.)

How though to allow simpletons to bind to clusters and write attributes to the pages and clusters in the cluster? One safe way to allow that is to force simpletons to bind to a cluster and extract contained page or cluster information, save that information internally (or write it to a 'sideband page' as described elsewhere), then detach from the cluster, reattach to each of the pages and clusters separately in the normal way, and finally write whatever they need to write to those pages and clusters. Note that the first phase (data acquisition) and the second phase (data alteration) can be done, once again, by two or more separate simpletons.

In this (somewhat roundabout) fashion are we guaranteed freedom from deadlock and data corruption.

Summary

Attributes:
Pages:
* A page can have multiple attributes.
* One particular attribute that a page might have is the clusters it belongs to, if it belongs to any clusters at present.
Clusters:
* A cluster can have multiple attributes just as pages do, however they need not be the same attributes.
* One particular attribute that every cluster must have is the predicate that the attributes of all pages and clusters inside the cluster satisfy.
* Another attribute that every cluster must have is the list of all pages and clusters whose attributes satisfy the cluster's predicate.
* Clusters can be ordered, and if so, that ordering is just another attribute of the cluster.
* Cluster attributes are distinct from the attributes of the pages and clusters they contain, although some of them ("name", for example) may be the same kind of attribute.
Pages and Clusters:
* Pages and clusters in a cluster must share some attribute(s) and there is some predicate that those shared attributes satisfy.
* Some (or all) attributes of either clusters or pages can be read-only.
Containment:
Pages:
* A page can be contained in multiple clusters.
* A page can be contained in no cluster.
Clusters:
* A cluster can be contained in multiple clusters.
* A cluster can be contained in no cluster.
* A cluster can contain multiple clusters and pages.
* A cluster can be empty.
* Two clusters can share pages without either cluster containing or being contained by the other cluster.
Binding:
Pages:
* A simpleton can attach to only one page at a time.
* A simpleton can attach to a page depending on whether the page's attributes satisfy a given predicate.
* A simpleton bound to a page can read and write the page's attributes.
Clusters:
* A simpleton can attach to only one cluster at a time.
* A simpleton can attach to a cluster depending on whether the cluster's attributes satisfy a given predicate.
* A simpleton bound to a cluster can read and write the cluster's attributes and can read the attributes of the pages and clusters in the cluster, but cannot write the attributes of the pages and clusters in the cluster.
Creation and Maintenance:
* Clusters can be created on the fly by searching some info pool.
* Clusters are created based on the user's demonstrated needs.
* Clusters are dynamically maintained as new objects enter the system.
* Clusters might (eventually) be mergeable or splittable.
last | | to sitemap | | up one level | | next