1.6 Rule-based Expert Systems for Information Retrieval

In the classification of IR techniques introduced in section 1.1 we have distinguished between structure-based and feature or index-based systems for IR. In the previous sections, we only have discussed feature-based systems. We will now present a classical representative of a structure-based system . Contrary to feature-based systems, structure-based IR systems demand structured knowledge about the document collection to be searched. The basic concepts described in this section are very similar to the CYC system presented later in chapter XX. Although the example outlined here is somewhat dated, it nicely illustrates strengths and weaknesses of this approach.

The RUBRIC system [McC83][Ton85] does rule-based reasoning about the knowledge base to be searched. This means that each document needs to have some rules attached that describe its contents. Rules in RUBRIC also have a probability value attached that defines the probability with which a certain rule applies to a particular subject area. A sample rule for terrorism would look like:

if (the story contains the literal string "bomb"),
    then (if it is about an explosive_device)
         to degree 0.6;
but if also (it mentions aboxing_match),
        then (reduce the strength of the conclusion)
              to degree0.3;

To find a story about "Violent Acts of Terrorism", RUBRIC tries to build an internal representation of how a story about this subject could look like. There are four elements that should be included in such a prototype as reflected in Fig. I.4: an actual violent event, a terrorist actor, the effect of the event having occurred, and the reason for the event.


Figure I.4 Rule-evaluation tree for "Terrorist Query"

Each edge in the tree has an attached relevance value, such that the intermediate topics and keyword expressions contribute according to their relevance value to the actual concept at the root of the tree. Unlabeled edges have an implicit relevance value of 1.

Obviously, systems like RUBRIC allow one to build a close model of the real world and thus offer high precision queries. The big drawback of these systems is the need for a rule-based description of the documents to be searched. In the prototypical system described here, the rule-base has been constructed manually. Until it is possible to automatically compute accurate, rule-based descriptions of documents, the practical use of such systems will obviously be very restricted. In practice, although the concepts of rule-based IR look very interesting, there are currently no large-scale commercial systems based on this approach. This might eventually change when the huge rule base of the CYC system described later in this book will be more widely available. In the meantime, most commercial IR systems are based on combinations of the vector space and probabilistic approaches.