17.5 Analysis of Structure and Contents of the Document
The preprocessing phase has two main steps. An additional, ancillary initial preprocessing step consists of an overall analysis of the document structure. This ancillary step gives the knowledgeable user the ability to modify the default settings for the two following main steps:
- generation of the index
- generation of the CYBERMAP structure
If the ancillary step is ignored, the two following steps are executed using default settings. The ancillary first steps allows to prestructure the nodes and to override default settings for finding sensible names for the representation of the nodes in the CYBERMAP.
Generation of the IndexIn the first main preprocessing step the system is looking for an existing index of the document. If there is no index, the system builds an index using simple automatic indexing techniques as described in chapter 1. The most recent CYBERMAP implementation is using the SWISH system described in chapter 7 to build the index. The index is used afterwards for:
Generation of the CYBERMAP Structure
- The automatic generation of a structure graph of the document based on the occurrence of index-terms in the nodes of the document.
- The generation of a list of index entries where users may select the terms most related to their interests. We call this list of index entries an interest list.
The second preprocessing step is the most important. Here the CYBERMAP structure is generated. The generation of the CYBERMAP is based on the index. To find a suitable map structure a two-step-process is used:
- Find the hyperdrawers:
In this step the nodes are clustered (figure. I.91).
Figure I.91 Relation "nodes in document - hyperdrawers in CYBERMAP"
- Similar nodes are identified using the index by assigning a keyword vector Di = (d1i, d2i, d3i,...dni) to every node ni where dki represents the weighted keyword Tk assigned to node ni. The weight for each keyword is computed based on the keyword frequency and the of the keyword. The between two nodes is based on the similarity between the corresponding keyword vectors [Sal89b]. It is defined as an (dot product):
- Finally the most frequently utilized keyword in all nodes of a particular hyperdrawer is selected as hyperdrawer label. This process is described in full detail in the next section.
- Find the relations between the hyperdrawers in the CYBERMAP:
The drawing of these structural links is not mandatory, but it helps the reader to get a better overview of the structure of the document. All possible links between hyperdrawers are calculated first, using the similarity measure defined in step one. Afterwards the most related hyperdrawers are connected by links. The number of links is determined using a dynamically adjustable threshold value.Once all structure links are calculated, the hyperdrawers for the nodes and the links between the hyperdrawers are drawn. To achieve a better hyperdrawer-link layout, the most frequently linked hyperdrawers are drawn as roots of a tree at the top of the screen. Less frequently linked hyperdrawers are arranged farther down.