Spidering BOF Report

[Note: This is an HTML version of the original notes from the Distributed Indexing/Searching Workshop ]

Report by Michael Mauldin (Lycos)
(later edited by Michael Schwartz)

While the overall workshop goal was to determine areas where standards could be pursued, the Spidering BOF attempted to reach actual standards agreements about some immediate term issues facing robot-based search services, at least among spider-based search service representatives who were in attendance at the workshop (Excite, InfoSeek, and Lycos). The agreements fell into four areas, but we report only three of them here because the fourth area concerned a KEYWORDS tag that many workshop participants felt was not appropriate for specification by this BOF without the participation of other groups that have been working on that issue.

The remaining three areas were:

ROBOTS meta-tag

<META NAME="ROBOTS"
      CONTENT="ALL | NONE | NOINDEX | NOFOLLOW">

default = empty = "ALL"
"NONE" = "NOINDEX, NOFOLLOW"

The filler is a comma separated list of terms: ALL, NONE, INDEX, NOINDEX, FOLLOW, NOFOLLOW.

Discussion: This tag is meant to provide users who cannot control the robots.txt file at their sites. It provides a last chance to keep their content out of search services. It was decided not to add syntax to allow robot specific permissions within the meta-tag.

INDEX means that robots are welcome to include this page in search services.

FOLLOW means that robots are welcome to follow links from this page to find other pages.

So a value of "NOINDEX" allows the subsidiary links to be explored, even though the page is not indexed. A value of "NOFOLLOW" allows the page to be indexed, but no links from the page are explored (this may be useful if the page is a free entry point into pay-per-view content, for example. A value of "NONE" tells the robot to ignore the page.

DESCRIPTION meta-tag

<META NAME="DESCRIPTION" CONTENT="...text...">

The intent is that the text can be used by a search service when printing a summary of the document. The text should not contain any formatting information.

Other issues with ROBOTS.TXT

These are issues recommended for future standards discussion that could not be resolved within the scope of this workshop.

Ambiguities in the current specification
A means of canonicalizing sites, using: HTTP-EQUIV HOST ROBOTS.TXT ALIAS
ways of supporting multiple robots.txt files per site ("robotsN.txt")
ways of advertising content that should be indexed (rather than just restricting content that should not be indexed)
Flow control information: retrieval interval or maximum connections open to server

The Web Robots Pages