7.4 Harvest
http://harvest.cs.colorado.edu/
Harvest is an integrated set of tools to gather, extract, organize, search, cache, and replicate relevant information across the Internet. It is currently being developed at the University of Colorado at Boulder. Harvest software is placed in the public domain. It consists of a robot for gathering information, indexing parts to build a searchable index, and a search engine to search the index and return the results to the Internet user. Harvest consists of the following subsystems:
- Gatherer
- The Harvest Gatherer is a robot that collects information optimized for indexing. The Gatherer scans objects periodically, maintaining a cache of indexing information, and allowing a provider's indexing information to be retrieved in a single stream (rather than requiring separate requests for each object).
- Broker
- The Broker provides an indexed query interface to gathered information. Brokers retrieve information from one or more Gatherers or other Brokers, and incrementally update their indexes. The Broker records unique identifiers and expiration dates for each indexed object, garbage collects old information, and invokes the Index/Search Subsystem when it receives an update or a query. Harvest provides a distinguished Broker called the Harvest Server Registry (HSR), that registers information about each Harvest Gatherer, Broker, Cache, and Replicator in the Internet.
- Index/Search Subsystem
- Harvest defines a general Broker-Indexer interface that can accommodate a variety of search engines. The principal requirements are that the backend supports Boolean combinations of attribute-based queries, and that it supports incremental updates. One can therefore use different backends inside a Broker. Harvest currently supports commercial and free WAIS, and its own search engines called Glimpse and Nebula.
- Replicator
- Harvest provides a weakly consistent, replicated wide-area file system called mirror-d, on top of which Brokers are replicated. Mirror-d itself is layered on top of a hierarchical group communication subsystem.
- Object Cache
- To speed up network access, Harvest includes its own hierarchical Object Cache. The Cache sends "query" datagrams to each neighbor and parent, plus an ICMP echo to the object's home site, and chooses the fastest responding server from which to retrieve the data.
Harvest therefore offers a complete public-domain solution for building a web-wide searchable information base. Its only disadvantage is its complexity compared with simpler, albeit much more limited systems such as SWISH.