1.3 Automatic Indexing

As we have seen in the previous section, the careful selection of index terms is of utmost importance for successful retrieval. Indexing can either be done manually, by choosing appropriate index terms for each document by hand, or automatically. The goal is to find identifiers that describe the contents of a documents as closely as possible. If manual indexing is done by a human expert using a limited vocabulary, a "...considerable degree of indexing uniformity can be achieved, and high-quality is sometimes obtained..."[Sal89, p. 276]. Unfortunately, manual indexing is an extremely tedious, time consuming, and error prone process. It is thus highly desirable to use automatic indexing. Because of the complexity of indexing, any automatic indexing operation will necessarily perform imperfectly. Nevertheless, retrieval results obtained by using advanced automatic indexing techniques are generally superior to queries executed in systems based on manually constructed indices, due to the uniformity of the automatically generated index.

The most important parameter to measure the success of a query is the precision (P), i.e., the proportion of retrieved material that is relevant [Sal89: 278]

Indexing can be divided in single-term indexing, where each index entry consists of a single term, and indexing with terms in context, where each index entry is composed of a group of terms. Obviously the later is much more complex to handle than using single-term indexing vocabularies. In the rest of this introduction we will limit the discussion to single-term indexing.

Definition of term frequency

To find appropriate weights for a term, the term frequency, i.e., the number of occurrences of a term in the document, is commonly used. Unfortunately, it is not sufficient to only count the number of occurrences of a term in the document, and, e.g., to only group the documents having the highest occurrences of the word "apple", because the precision requirement also demands terms with a high discrimination value. In fact, the precision function is better served with terms that occur rarely in individual document collections. The best discrimination can be reached by terms that occur frequently in one document, but rarely in the whole document collection. A useful term weight for a term Tk in document Di is, e.g., defined as the term frequency multiplied by the inverse collection frequency, i.e., the number of occurrences of the term in the whole document collection [Jon72]. The two sources of weighting data are therefore

Collection frequency
Terms which occur in only a few documents are likely to be more useful than ones occuring in many.
Term frequency
The more frequently a term appears in a document, the more important it is likely to be for that document.

Definition of collection frequency

This means, that the more a term occurs in one document, and the less frequent it is in the whole document collection, the better is its discrimination value[1].

Salton defines in [Sal89] a blueprint for automatic indexing:

  1. After having identified the individual words in a document collection, use a stop word list to eliminate the words that produce only noise, but are of no discrimination value (and, but, or, the, etc.,...).
  2. Use automatic suffix stripping to get only the word stems (use only "philosoph*" instead of philosopher, philosophic, philosophical, philosophically, etc.).
  3. For each remaining word stem Tk (= term) in each document compute its term weight wik using the above discussed formula:
  4. Represent each document by the vector containing the set of terms together with the corresponding weights Di = (T1,wi1; T2,wi2; T3,wi3;...)
Salton [Sal91] describes some experiments to use content structure for the improvement of text retrieval. He also stresses the problem that a word can have different meanings depending on its context and that thus simple pattern matching based retrieval only works in very limited cases. He suggests to include the node length into the computations of the similarities to give each node an equal chance of being retrieved.