20.3 Cybertree Examples
In this chapter we show four empirical examples of the Kruskal-based algorithm. These examples illustrate possible applications of cybertrees, although they are of practical use only for smaller document collections and are therefore of lesser use for the web. Nevertheless, cybertrees offer a valuable tool to get an initial first overview and navigation instrument in unknown territory.
The accuracy of the algorithm has been measured manually by comparing the correctly placed nodes with (from a human-based viewpoint) misplaced nodes.
Example 1 - A homogeneous collection of data nodes about dinosaurs:
Each node contains some keywords out of a limited vocabulary describing the particularities of the dinosaur pictured on that node (see also the guided tour in section 18.5). In addition there are three data nodes containing free-text background information about dinosaurs.Example 2 - A computer science books database:
Each entry was indexed manually by a librarian. For the computation of the cybertrees, only these keywords have been used, i.e., for the purpose of the algorithm, each node only consists of a short list of keywords.Example 3 - A free text collection of e-mail messages:
These messages all come from a mailing list discussing how to connect the Mattel PowerGlove to a personal computer. In this example all words except stopwords (a, an, by, of, the,...) have been indexed.Example 4 - Political news messages
This example consists of a one-day sampling of political news messages with indexing over all words except stopwords.In the first example, our database consists of a homogeneous collection of nodes where each node describes one dinosaur. In this homogeneous case, the algorithm works extremely well. Figure I.119 contains the local cybertrees of the dinosaur database. It has an accuracy of 95%, meaning that all but 2 out of 42 nodes have been put at meaningful places and that the clustering into groups of nodes has been meaningful. For example, the third local cybertree (with root "Mixosaurus Icthy") contains three nodes about dolphin-like .
Figure I.119 Local cybertrees of dinosaur databaseFigure I.120 contains the global cybertree as a result of the execution of algorithm CREATE-GLOBAL-TREE for the tree in figure I.119 It exhibits the same accuracy as the local cybertrees (95%) and gives a meaningful overview of all data contained in the dinosaur database. It shows us that there is an introductory node ("CDID18052") and an overview node (titled "destination") and that there are four main groups of dinosaurs: "Mixosaurus Icthy" (dolphin like beasts), "Pteranodon" (flying dinosaurs), "Allosaurus" (meat eaters and rhinoceros like beasts of the same time period), "Stegosaurus" (big herbivores) and some entries that could not be grouped tightly like "coiled belemnite" (a snail-like animal).
Figure I.120 Global cybertree of dinosaur databaseThe next example (figure I.121) shows some of the local cybertrees of the catalogue of the MIT computer science library. Here the tree algorithm uses the index generated manually by the librarians. We get an accuracy of 100%, meaning that each node in the trees has been placed meaningfully. Of course, this example is somewhat artificial because the tree generation was based on only a few keywords per node (2-5) out of a well defined set. It is a proof by example of our Kruskal-based algorithm, but there are few real world examples that exhibit the characteristics of such a small, consistent, and carefully chosen keyword index.
Figure I.121 Local cybertrees for manually indexed computer science databaseThe algorithm does not work that well for large, unordered document sets. One interfering factor is big differences in node size, i.e., document collections where some nodes contain many more words than others[24]. Another distorting factor is the fact that the same word can have different meanings depending on its context.
The next example (figure I.122) is based on a document set that exhibits both of the above mentioned distorting characteristics. The document set consists of a collection of e-mail messages from a newsgroup discussing how to connect a Mattel to a personal computer. The nodes are of strongly varying size and contain e-mail headers, plain English text and program fragments in C or assembler language varying in node size from 44 to 2810 words. Due to the informal nature of e-mail, words have been used very inconsistently. Also, replies sometimes contain the full text of the original message. Still, approximately 50% of the 224 nodes have been placed meaningfully. Unfortunately this makes the cybertree useless in this case, because readers have to expect that a link they follow leads half the time to a conceptually unrelated node.
Figure I.122 Global cybertrees for collection of e-mail messagesThe last example is based on a document collection that exhibits, at least at first glance, similar characteristics to the previous example. The document set consists of 51 news messages covering political world news of a day from the US perspective, varying in node size from 77 to 953 words. Surprisingly we get here an accuracy of 98%, i.e., only one node has been vastly misplaced in the cybertree[25]. We explain the surprisingly good result of our algorithm by the narrow and well defined problem domain, where each word basically has one meaning. Although the articles have been written by different authors, they all had similar objectives and used a common vocabulary and format.
Figure I.123 Local cybertrees for collection of political news messagesBased on our limited empirical experiments we conclude that the Kruskal-based algorithm is quite stable against varying node size. An overly large node is located closer to the root of a tree, but frequently this behavior is desired. On the other hand, the algorithm is very sensitive to the consistent use of words, as was exemplified by the poor quality of the e-mail messages cybertree.
These cybertree examples nicely illustrate problems that are inherent in text-based clustering such as inconsistent word usage, the problem of mixing different languages, and a sensibility to unbalanced node sizes. If those problems can be excluded (and there are many document collections that are of the required homogeneous nature), cybertree offers a quick approximation of the more computationally expensive CYBERMAP.