1.1 Classification of Retrieval Techniques
This section briefly reviews a classification of text retrieval techniques that has been proposed by Nicholas Belkin and Bruce Croft [Bel87] (Fig. I.1).
Figure I.1 Classification of retrieval techniquesOn the highest level, Belkin and Croft distinguish between exact and partial match techniques. Exact match techniques are currently at use in most conventional IR systems. Queries are usually formulated using Boolean expressions and the search pattern within the query has to exactly match the text representation inside of the document to be retrieved.
Within the there are many different variants. Individual techniques search single document nodes without considering the document collection as a whole. For the feature-based techniques, documents are represented by sets of features or index terms. The index can be either defined manually or be computed automatically. The most prominent representative of this category is the vector space model, which is based on a formal model of document retrieval and indexing (see the section about the vector space model later in this chapter). In the vector space model, each document is represented by an index vector containing a set of weighted terms. For each query, documents are ranked in decreasing order of similarity to the query. The probabilistic approach is similar to the vector space model: the basic goal is to retrieve documents in order of their probability of relevance to the query. The probabilistic retrieval model was first developed by Steve Robertson and Karen Sparck Jones in the 1970s. In the probabilistic model, (as contrasted with a Boolean retrieval system) a query typed by a user to retrieve information is taken as an unstructured list of words or phrases. These terms are then matched to the documents in the database. Some authors also have suggested a fuzzy set approach for feature-based formal IR systems. Contrary to formal feature-based IR systems, there have also been a number of ad hoc similarity measures been proposed. Most of the feature-based approaches have the problem that small differences in weights can lead to significant differences in results.
For structure-based techniques, documents are represented in a more complicated structure than just a set of index terms as used for the feature-based techniques. It is theoretically possible to represent the contents of the document collections in formal logic. Systems in this category could, e.g., use rules to describe how relevant fragments of a document are related to the query (see the RUBRIC system). Instead of using logic to represent the contents of a document, the document's contents also can be described as a graph where the edges and nodes of the graph stand for ideas and relationships contained in the document.
With network-based methods, the set of all documents and their relationships are used to find the most relevant documents with respect to a query. The most prominent method is clustering, where the most similar documents are clustered together and all documents are grouped into a cluster hierarchy until a ranked list of lowest-level clusters is produced. If the documents are represented as a network of nodes, the user can also browse through the network with system assistance. Through dialog with the user, the system can use the network to build a model of the user. Based on the user model, a model of the user's information needs can be constructed, including relevant documents found during the browsing session. Spreading activation is similar to browsing in that, from the start node, other nodes connected to that node are activated. Activated nodes then propagate or spread themselves through the network.
After this brief tour d' horizon of existing text retrieval methods, we will now give a more fundamental overview of conventional methods for automatic document and text retrieval. Sections 1.2 to 1.5 will give a systematic introduction to feature-based IR. Similar mechanisms are used by our own system, CYBERMAP, to be presented later in part I. Section 1.6 will present a typical example of a structured-based IR system. Finally, section 1.7 will bridge the gap to hypermedia.