7.3 Robots

http://info.webcrawler.com/mak/projects/robots/robots.html

http://altavista.digital.com/

http://www.lycos.com

http://inktomi.berkeley.edu/query.html

Robots are software programs that traverse the Web automatically. Robots are sometimes also called Web Wanderers, Web Crawlers, or Spiders. These names are somewhat misleading as they give the impression that the software itself would move between sites like a virus. In reality, a robot simply visits sites by requesting documents from them. Search engines as, e.g., WAIS, Lycos, Alta Vista, Inktomi, etc. are not robots, but programs that search through information collections that are, e.g., gathered by a robot. Robots are used on the web for different purposes, such as indexing, HTML validation, link validation, "What's New" monitoring, mirroring, etc..

Indexing robots

The most popular application of robots is for gathering web pages to be indexed for search engines. Indexing robots employ various search strategies to decide what web sites to visit. In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web. Most indexing services also allow to submit URLs manually, that are then added to the historical list of URLs. Using those starting points, a robot selects URLs to visit and index, and to parse and use as a source for new URLs. If an indexing robot knows about a document, it may decide to parse it, and insert it into its database. How this is done depends on the robot: Some robots index the HTML titles, or the first few paragraphs; others parse the entire HTML text and index all words, with weighting depending on HTML constructs.

Risks using robots

http://www.w3.org/pub/WWW/Robot/

Robots should only be used on the web by experienced web programmers, as there are risks involved when robots are let loose without considering the consequences:

A potential problem with indexing robots is the fact that centralized search databases built by web-wide indexing robots do not scale easily to millions of documents on millions of sites.

Keeping robots out

Most of the robots provide valuable services to the web community. However, a standard has been established for keeping robots away from certain pages, or even blocking them totally from visiting a server. To block a robot from visiting a server, the following two lines need to be placed into the server:

User-agent: *
Disallow: /

It is also possible to selectively specify in the /robots.txt file if certain robots should be prohibited from visiting particular files. Of course, this procedure only works if the robot itself obeys this protocol.