17.6 Computing the Similarity Between Nodes
This section describes the computation of the similarity measure between nodes, as sketched out in the previous section, in greater detail. The similarity measure is the basis of the CYBERMAP structure. This means, that for all possible combinations of two nodes (i.e.
combinations where n is the number of all nodes in the document) we are trying to compute a number that is proportional to the relatedness of the two nodes.
The initial CYBERMAP implementation uses an approach described by Salton and Buckley [Sal89b] for the computation of the similarity between nodes, based on a keyword index of the whole document. We may use either an index that is available from an external source, or automatically generate one using automatic indexing techniques (see chapter 1).
The subsequent computation of the similarities between nodes happens in five main steps (see figure I.92 for an actual example):
- Count the number of occurrences of each keyword in each node (node-frequency).
Example: the keyword "human-computer interaction" in figure I.92 has node-frequency 1 in nodes 1 and 2 and node-frequency 2 in node 3.
- Count the number of occurrences of each keyword in the whole document. Each keyword is counted only once per node (document-frequency).
Example: the keyword "human-computer interaction" in figure I.92 has document-frequency 3.
- Compute the weighted keyword vectors of each node (column 2 in figure I.92). The weight for each keyword is calculated based on the node-frequency, the inverse document-frequency of the keyword, and the total number of nodes in the document (section 1.3)[20]. The exact formula used is: weight of keyword Ti in node k:
Example: the keyword "human-computer interaction" in figure I.92 has weight 1*log(7/3) = 1.222392 in nodes 1 and 2 and weight 2*log(7/3) = 2.444785 in node 3.
- Assign a keyword vector Dk = (d1k, d2k, d3k,...dnk) to every node nk, where dik represents the weight of keyword Ti assigned to node nk.
- Compute the similarity between the nodes using the keyword vectors Dk. The similarity between two nodes is based on the similarity between the corresponding keyword vectors [Sal89b]. It is defined as an inner vector product:
Example: the similarity between node 2 and node 3 in figure I.92 is:
1.222392*2.444785+1.807355*1.807355 = 6.255018.
Note that in figure I.92 the similarities are already sorted and listed in decreasing order of similarity.
Figure I.92 Example of computing similarities