20.2 Comparison between CYBERMAP and Cybertree

The CYBERMAP algorithm discussed in chapter 17 offers a lot of flexibility concerning the total number of hyperdrawers and the maximum number of nodes in a hyperdrawer. It is straightforward to reduce the number of hyperdrawers by merging them. Alternatively, one can remove some hyperdrawers and redistribute their contents throughout the rest of the document on a node-by-node basis. It is also easy to subdivide hyperdrawers by calling the clustering procedure recursively within a hyperdrawer to get a hierarchical separation of a hyperdrawer into sub-hyperdrawers. But all this flexibility has its price: it takes a supercomputer as a Connection Machine about 3 to 30 minutes to compute the CYBERMAP of a moderately large document collection [23].

The Kruskal-based cybertree algorithm does not offer this flexibility. It is conceptually easy and computationally cheap, once the similarity list is available. But the only way to reduce or increase the number of clusters is by splitting or merging trees. That means that there is only one way of arranging the inner structure of a cluster. In particular, there is no easy way of reducing the total number of local trees by distributing their elements among other clusters, a procedure that can easily be done with CYBERMAP. But on the other hand, the Kruskal-based algorithm generates a graphical layout of the nodes and clusters them in one single pass. With CYBERMAP, figuring out the graphical layout demands an additional computation step that can be done, for example, with the Kruskal-based algorithm.

Thus, CYBERMAP offers more flexibility concerning the clustering of nodes and therefore generally results in better and more meaningful clusters. On the other hand, it is much more computing intensive and, without running on a supercomputer like a Connection Machine CM-2, only works for small document sets. The Kruskal-based algorithm offers a computationally cheaper way to get a first clustering approximation and delivers a tree representation of the nodes in the same step. It can also be used for quickly arranging a tree of hyperdrawers.

The Scatter/Gather clustering algorithm used for the CYBERMAP web version offers the best of both worlds in that it takes a Silicon Graphics Indy workstation using Java about 5 minutes to compute the CYBERMAP of a web site with a few hundred documents. We are using a combination of Salton's algorithm and Scatter/Gather for clustering (see section 18.3), and also allow for hierarchical nestings of hyperdrawers as described in chapter 19, and employ the cybertree algorithm for the layout of the hyperdrawer structure on the screen.