1.5 Document Clustering
Definition of centroid
The vector space model described so far has the main disadvantage that the information pertaining to a particular document is distributed among many different inverted-term lists. Furthermore, documents containing similar information are not stored in physical proximity. If browsing is to be allowed, however, documents containing similar information should be stored close together. Therefore, the basic idea is to cluster related documents together. Normally, hierarchies of clusters are constructed, where the leaves are the actual documents. Virtual nodes on the higher levels are used to group the related documents. The virtual nodes on the higher levels are represented by centroids, or average term vectors (Figure I.3).
Figure I.3 Cluster hierarchy with centroidsThe centroid may actually be the term vector of one of the documents in the cluster it represents, but generally it is computed using some statistical method to be in the center of the cluster. One obvious way, e.g., is to just compute some average of each vector element for all documents in the cluster.
Simple clustering is normally done using some variant of the following method [Sal89: 329]:
Once the cluster hierarchy has been created, it is relatively cheap to process a query by searching the cluster tree doing either a top down or a bottom up search.
- Compute all document similarity coefficients (n*(n-1)/2 coefficients for n documents).
- Put each of the n documents into its own cluster.
- Create a new cluster by combining the two most similar clusters Ci and Cj, delete the rows and columns for i and j in the similarity matrix, recompute the centroid for the newly created cluster, and update the similarity matrix.
- Repeat step 3 if the number of clusters left is greater than 1.