19.1 Hierarchical Clustering Algorithms

This section describes the hierarchical CYBERMAP clustering algorithm. The algorithm starts with the similarities between nodes as defined in the previous section. For clustering, it uses the notion of centroid or average keyword vector (see chapter 1):

Definition of centroid

The C of a cluster is the vector C = (c1, c2, c3,...cn), where ci is the average of all weights dik for keyword Ti over all nodes nk in the cluster.

The algorithm now works as follows:
  algorithm ClusterHierarchical MaxNumOfClusters, maxNumOfNodes
  Put each node into a separate cluster;

  Repeat while the actual number of clusters is larger than MaxNumOfClusters
  and there are clusters that contain more nodes than maxNumOfNodes

    Repeat for all clusters
      If current cluster contains more nodes than maxNumOfNodes

        then
          ClusterHierarchical MaxNumOfClusters, maxNumOfNodes;

        else
          Repeat for all nodes within current cluster

             Add node n to the cluster that has the highest similarity 
             to n, based on the centroid C of the cluster, i.e., recompute
             the similarities between n and all clusters;

                Take precautions against strong fluctuations in the 
                number of words per node:

                   (1)define a threshold value for very large nodes, and
                    start by clustering the smallest nodes first;

                   (2)normalize the weights of the keyword vector with
                    respect to the number of words;

The output of this algorithm not only consists of a tree with the parent-child relation representing "parent-cluster contains child-clusters", but it also returns the similarities between the tree-nodes. The leaves of the tree are the single nodes. The algorithm allows one to specify in advance the number of hyperdrawers per CYBERMAP and the number of nodes per hyperdrawer. If, during the execution of the algorithm, there are too many hyperdrawers in a CYBERMAP, the CYBERMAP splits into multiple, lower level CYBERMAPs. Similarly, if there are too many nodes in a hyperdrawer, the particular hyperdrawer is made a CYBERMAP, and its nodes are split into lower level hyperdrawers.

Unlike the original, non-hierarchical CYBERMAP, this algorithm allows the user to control the number of hyperdrawers at the highest hierarchy level, while still clustering the most related nodes at the lower level.