18.1 Implementing CYBERMAP on the Connection Machine
We ported the algorithm to the supercomputer CM-2 that allows us to generate CYBERMAPs of large documents (a few thousand nodes with a few hundred words each) in about 3 to 30 minutes.
The CM-2 is a massively parallel computer made by Thinking Machines Corporation [Hil87]. Contrary to other machines, as, e.g., the CM-5, its successor, the CM-2 is a SIMD (single instruction multiple data) machine, which means that the same instruction can be executed in parallel on thousands of data objects (as many as there are processors in the CM-2). The CM-2 is a fine-grained parallel computer with a maximum configuration of 65536 processing elements. Each processing element contains 4096 bits of memory and a 1 bit wide ALU (arithmetic logic unit). All elements are connected via a 12 dimensional hypercube network which contains 16 processing elements on each vertex. Any processor can communicate with any other processor on this network. The CM-2 uses a parallel disk array called the Data Vault(TM) mass storage system. Each Data Vault unit provides up to 40 GB of storage with a transfer rate of 25 MB per second. Data Vault files are vector-structured: each location in the file stores one byte/word from each processing element in the machine.
Owing to the drastic differences between Macintosh and Connection Machine, we decided to start from scratch with the CM-2 implementation. As implementation language, we chose C*, which is the parallel version of C on the Connection Machine containing additional constructs to use the data parallel features of the CM-2.
To address the problem of nonuniformity in the length of words, we decided to abstract the words away by giving each unique (i.e. lexicographically distinct) word in the database a numeric key or ID code. The determination of these word IDs would be the first step in processing any document, and any other structures such as the index would use these IDs to refer to words.
The abstraction of words into numeric codes does provide certain advantages. The codes are uniform in size and on the average slightly more compact than the full text of the words for large databases. The numeric codes can easily be compared and operated upon using fast integer operations. This makes certain tasks, like determining if two words are the same, easier by comparing the numbers rather than the full strings. Also, by carefully selecting the word ID generating function, one can maintain other properties within the word ID code. For instance, by sorting all the words and using the rank of the word in the sorted order, one can determine the dictionary order of two words by comparing their word IDs. Further, these numerical IDs can easily form an index into a vector, easing the construction of centroid vectors needed for the clustering algorithms.
The textset is a C* structure used to store information on a hyperdocument. The textset contains several parts: a linked list of units which are discrete pieces in the document (normally the nodes in the hyperdocument), a dynamic set of all the words contained in the document called the wordspace, and an index of the document, containing information about which words appear in which units. The functions of the textset are to keep track of what units belong to the document it represents, to keep track of any changes to the set of units, and to make sure that the wordspace and index are maintained in accord with the contents of the set of units. Thus, the textset makes sure that the wordspace contains all the words in the document, adding words when new units are added. Then, using the word IDs generated by the wordspace, the index is built.
This method of using abstract codes for words has the drawback that to process a document, one must essentially repeat it twice; once to compute all of the word IDs, and then again to compile index information using the word IDs as key information. In the current implementation, the wordspace has no convenient provision for the deletion of words, thus removing words entails completely rebuilding the wordspace. The index also has no provision for incremental updates, which means that most changes to the wordspace require the index to be rebuilt. We did not make any efforts to improve this part of the system, as the speed of the Connection Machine allows us to rebuild the textset of a quarter-megabyte document in about thirty seconds [21]. As the development is still in a prototype stage, emphasis has been placed on developing a complete, working system which is effective enough to do the task, yet sufficiently flexible to withstand continuous development and many incremental extensions. While performance is a consideration in developing the algorithms, the development has sought not to provide performance at the cost of developing maintainable and extensible code.
Although the Connection Machine implementation is a powerful tool, it was ultimately not the right direction for CYBERMAP. This implementation exhibits several faults which are fundamental to the CM-2 system. The many variations in the textual data, i.e., word and node sizes, demand careful consideration, as the CM-2's SIMD architecture favors very uniform data. While the power of the CM-2 hides many of these problems, we have found instances of documents which cause the system to work inefficiently due to the granularity of the problem, thus defying an efficient mapping to the architecture. This problem, however, can be overcome through further development. One problem that could not be overcome, however, was access to the Connection Machine itself. The CM-2 was a scarce, non-shareable resource in high demand. As Thinking Machines Corporation went out of the hardware business in 1995 altogether, we found reliance on a dedicated special purpose parallel computer too restricting. We therefore decided to develop the next version using C for indexing, and Java for clustering and the graphical user interface.