1.2 Text Retrieval Using Inverted Indexing

As noted by Salton in [Sal89], the retrieval of stored documents in answer to an information request is based on determining the similarity between the query and the stored documents (Fig. I.2).


Figure I.2 Concept of text retrieval

The stored documents are represented by a set of index terms, sometimes called term vectors. Terms can either be unweighted, or each term can have a weight attached to reflect its relative importance. To speed up access to the documents, for each term in the index set an inverted index can be created that lists all the document addresses of the documents containing that particular term.

The addition of for document and query terms allows to distinguish between terms that are more important for retrieval purposes and terms that are less important. A particular document would then be represented, e.g., as Di = (T1i,0.2; T2i,0.6; T3i,0.1) meaning that term 2 of document i (T2i,) has an importance weight of 0.6 while term 3 has a much smaller importance weight of 0.1. A possible weighting strategy is described in the next section.

An obvious way to get more broadly applicable terms is to use . For example, the expression "philosoph*" represents the whole list of philosopher, philosophic, philosophical, philosophically, philosophize, philosophizer, philosophizing, philosophy, etc.. Various term truncation methods can be used, including suffix and prefix truncation. For a more in-depth treatment of this subject see [Sal89: 240ff].