19.1 Hierarchical Clustering Algorithms
This section describes the hierarchical CYBERMAP clustering algorithm. The algorithm starts with the similarities between nodes as defined in the previous section. For clustering, it uses the notion of centroid or average keyword vector (see chapter 1):
Definition of centroidThe C of a cluster is the vector C = (c1, c2, c3,...cn), where ci is the average of all weights dik for keyword Ti over all nodes nk in the cluster.
The algorithm now works as follows:
algorithm ClusterHierarchical MaxNumOfClusters, maxNumOfNodes Put each node into a separate cluster; Repeat while the actual number of clusters is larger than MaxNumOfClusters and there are clusters that contain more nodes than maxNumOfNodes Repeat for all clusters If current cluster contains more nodes than maxNumOfNodes then ClusterHierarchical MaxNumOfClusters, maxNumOfNodes; else Repeat for all nodes within current cluster Add node n to the cluster that has the highest similarity to n, based on the centroid C of the cluster, i.e., recompute the similarities between n and all clusters; Take precautions against strong fluctuations in the number of words per node: (1)define a threshold value for very large nodes, and start by clustering the smallest nodes first; (2)normalize the weights of the keyword vector with respect to the number of words;
The output of this algorithm not only consists of a tree with the parent-child relation representing "parent-cluster contains child-clusters", but it also returns the similarities between the tree-nodes. The leaves of the tree are the single nodes. The algorithm allows one to specify in advance the number of hyperdrawers per CYBERMAP and the number of nodes per hyperdrawer. If, during the execution of the algorithm, there are too many hyperdrawers in a CYBERMAP, the CYBERMAP splits into multiple, lower level CYBERMAPs. Similarly, if there are too many nodes in a hyperdrawer, the particular hyperdrawer is made a CYBERMAP, and its nodes are split into lower level hyperdrawers.
Unlike the original, non-hierarchical CYBERMAP, this algorithm allows the user to control the number of hyperdrawers at the highest hierarchy level, while still clustering the most related nodes at the lower level.