The Web Robots Pages

Evaluation of the Standard for Robots Exclusion

Martijn Koster, 1996

Abstract

This paper contains an evaluation of the Standard for Robots Exclusion, identifies some of its problems and feature requsts, and recomends future work.

Introduction
Architecture
Problems and Feature Requests
Recommendations

Introduction

The Standard for Robots Exclusion (SRE) was first proposed in 1994, as a mechanism for keeping robots out of unwanted areas of the Web. Such unwanted areas included:

infinite URL spaces in which robots could get trapped ("black holes").
resource intensive URL spaces, e.g. dynamically generated pages.
documents which would attract unmanageable traffic, e.g. erotic material.
documents which could represent a site unfavourably, e.g. bug archives.
documents which aren't useful for world-wide indexing, e.g. local information.

The Architecture

The main design consideration to achieve this goal were:

simple to administer,
simple to implement, and
simple to deploy.

This specifically ruled out special network-level protocols, platform-specific solutions, or changes to clients or servers.

Instead, the mechanism uses a specially formatted resource, at a know location in the server's URL space. In its simplest form the resource could be a text file produced with a text edittor, placed in the root-level server directory.

This formatted-file approach satisfied the design considerations: The administration was simple, because the format of the file was easy to understand, and required no special software to produce. The implementation was simple, because the format was simple to parse and apply. The deployment was simple, because no client or server changes were required.

Indeed the majority of robot authors rapidly embraced this proposal, and it has received a great deal of attention in both Web-based documentation and the printed press. This in turn has promoted awareness and acceptance amongst users.

Problems and Feature Requests

In the years since the inital proposal, a lot of practical experience with the SRE has been gained, and a considerable number of suggestions for improvement or extensions have been made. They broadly fall into the following categories:

operational problems
general Web problems
further directives for exclusion
extensions beyond exclusion

I will discuss some of the most frequent suggestions in that order, and give some arguments in favour or against them.

One main point to keep in mind is that it is difficult to gauge how much of an issue these problems are in practice, and how wide-spread support for extensions would be. When considering further development of the SRE it is important to prevent second-system syndrome.

Operational problems

These relate to the administration of the SRE, and as such the effectiveness of the approach for the purpose.

Administrative access to the /robots.txt resource

The SRE specifies a location for the resource, in the root level of a server's URL space. Modifying this file generally requires administrative access to the server, which may not be granted to a user who would like to add exclusion directives to the file. This is especially common in large multi-user systems.

It can be argued this is not a problem with the SRE, which after all does not specify how the resource is administered. It is for example possible to programatically collect individual's '~/robots.txt' files, combining them into a single '/robots.txt' file on a regular basis. How this could be implemented depends on the operating system, server software, and publishing process. In practice users find their adminstrators unwilling or incapable of providing such a solution. This indicates again how important it is to stress simplicity; even if the extra effort required is miniscule, requiring changes in practices, procedures, or software is a major barrier for deployment.

Suggestions to alleviate the problem have been producing a CGI script which combines multiple individual files on the fly, or listing multiple referral files in the '/robots.txt' which the robot can retrieve and combine. Both these options suffer from the same problem; some administrative access is still required.

This is the most painful operational problem, and cannot be sufficiently addressed in the current design. It seems that the only solution is to move the robot policy closer to the user, in the URL space they do control.

File specification

The SRE allows only a single method for specifying parts of the URL space: by substring anchored at the front. People have asked for substrings achored at the end, as in "Disallow: *.shtml", as well as generlised regular expression parsing, as in 'Disallow: *sex*'. XXX

The issue with this extension is that it increases complexity of both administration and implementation. In this case I feel this may be justified.

Redundancy for specific robots

The SRE allows for specific directives for individual robots. This may result in considerable repetiton of rules common to all robots. It has been suggested that an OO inheritance scheme could address this.

In practice the per-robot distinction is not that widely used, and the need seems to be sporadic. The increased complexity of both adminstration and implementation seems prohibitive in this case.

Scaleability

The SRE groups all rules for the server into a single file. This doesn't scale well to thousands or millions of individually specified URL's.

This is a fundamental problem, and one that can only be solved by moving beyond a single file, and bringing the policy closer to the individual resources.

Web problems

These are problems faced by the Web at large, which could be addressed (at leats for robots) separately using extensions to the SRE. I am against following that route, as it is fixing the problem in the wrong place. These issues should be addressed by proper general solution separate from the SRE.

"Wrong" domain names

The use of multiple domain names sharing a logical network interface is a common practice (even without vanity domains), which often leads to problems with indexing robots, who may end up using an undesired domain name for a given URL.

This could be adressed by adding a "preferred" address, or even encoding "preferred" domain names for certain parts of a URL space. This again increases complexity, and doesn't solve the problem for non-robots which can suffer the same fate.

The issue here is that deployed HTTP software doesn't have a facility to indicate the host part of the HTTP URL, and a server therefore cannot use that to decide the availability of a URL. HTTP 1.1 and later address this using a Host header and full URI's in the request line. This will address this problem accross the board, but will take time to be deployed and used.

Mirrors

Some servers, such as "webcrawler.com", run identical URL spaces on several different machines, for load balancing or redundancy purposes. This can lead to problems when a robot uses only the IP address to uniquely identify a server; the robot would traverse and list each instance of the server separately.

It is possible to list alternative IP addresses in the /robots.txt file, indicating equivalency. However, in the common case where a single domain name is used for these separate IP addresses this information is already obtainable from the DNS.

Updates

Currently robots can only track updates by frequent revisits. There seem to be a few: the robot could request a notification when a page changes, the robot could ask for modification information in bulk, or the SRE could be extended to suggest expirations on URL's.

This is a more general problem, and ties in to caching issues and the link consistency. I will not go into the first two options as they donot concern the SRE. The last option would duplicate existing HTTP-level mechanisms such as Expires, only because they are currently difficult to configure in servers. It seems to me this is the wrong place to solve that problem.

Further directives for exclusion

These concern further suggestions to reduce robot-generated problems for a server. All of these are easy to add, at the cost of more complex administration and implementation. It also brings up the issue of partial compliance; not all robot may be willing or able to support all of these. Given that the importance of these extensions is secondary to the SRE's purpose, I suggest they are to be listed as MAY or SHOULD, not MUST options.

Multiple prefixes per line

The SRE doesn't allow multiple URL prefixes on a single line, as in "Disallow: /users /tmp". In practice people do this, so the implementation (if not the SRE) could be changed to condone this practice.

Hit rate

This directive could indicate to a robot how long to wait between requests to the server. Currently it is accepted practice to wait at least 30 seconds between requests, but this is too fast for some sites, too slow for others.

A limitation is that this would specify a value for the entire site, whereas the value may depend on specific parts of the URL space.

ReVisit frequency

This directive could indicate how long a robot should wait before revisiting pages on the server.

A limitation is that this would specify a value for the entire site, whereas the value may depend on specific parts of the URL space.

This appears to duplicate some of the existing (and future) cache-consistency measures such as Expires.

Visit frequency for '/robots.txt'

This is a special version of the directive above; specifying how often the '/robots.txt' file should be refreshed.

Again Expires could be used to do this.

Visiting hours

It has often been suggested to list certain hours as "preferred hours" for robot accesses. These would be given in GMT, and would probably list local low-usage time.

A limitation is that this would specify a value for the entire site, whereas the value may depend on specific parts of the URL space.

Visiting vs indexing

The SRE specifies URL prefixes that are not to be retrieved. In practice we find it is used both for URL's that are not to be retrieved, as ones that are not to be indexed, and that the distinction is not explicit.

For example, a page with links to a company's employees pages may not be all that desirable to appear in an index, whereas the employees pages themselves are desirable; The robot should be allowed to recurse on the parent page to get to the child pages and index them, without indexing the parent.

This could be addressed by adding a "DontIndex" directive.

Extensions beyond exclusion

The SRE's aim was to reduce abuses by robots, by specifying what is off-limits. It has often been suggested to add more constructive information. I strongly believe such constructive information would be of immense value, but I contest that the '/robots.txt' file is the best place for this. In the first place, there may be a number of different schemes for providing such information; keeping exclusion and "inclusion" separate allows multiple inclusions schemes to be used, or the inclusion scheme to be changed without affecting the exclusion parts. Given the broad debates on meta information this seems prudent.

Some of you may actually not be aware of ALIWEB, a separate pilot project I set up in 1994 which used a '/site.idx' file in IAFA format, as one way of making such inclusive information available. A full analysis of ALIWEB is beyond the scope of this document, but as it used the same concept as the '/robots.txt' (single resource on a known URL), it shares many of the problems outlined in this document. In addition there were issues with the exact nature of the meta data, the complexity of administration, the restrictiveness of the RFC822-like format, and internationalisation issues. That experience suggests to me that this does not belong in the '/robots.txt' file, except possibly in its most basic form: a list of URL's to visit.

For the record, people's suggestions for inclusive information included:

list of URI's to visit
perl-URL meta information
site administrator contact information
description of the site
geographic information

Recommendations

I have outlined most of the problems and missed features of the SRE. I also have indicated that I am against most of the extensions to the current scheme, because of increased complexity, or because the '/robots.txt' is the wrong place to solve the problem. Here is what I believe we can do to address these issues.

Moving policy closer to the resources

To address the issues of scaling and administrative access, it is clear we must move beyond the single resource per server. There is currently no effective way in the Web for clients to consider collections (subtrees) of documents together. Therefore the only option is to associate policy with the resources themselves, ie the pages identified with a URL.

This association can be done in a few ways:

Embedding the policy in the resource itself: This could done using the META tag, e.g. <META NAME="robotpolicy" CONTENT="dontindex">. While this would only work for HTML, it would be extremely easy for a user to add this information to their documents. No software or administrative access is required for the user, and it is really easy to support in the robot.
Embedding a reference to the policy in the resource: This could be done using the LINK tag, e.g. <LINK REL="robotpolicy" HREF="public.pol"> This would give the extra flexibility of sharing a policy among documents, and supporting different policy encodings which could move beyond RFC822-like syntax. The drawback is increased traffic (using regular caching) and complexity.
Using an explicit protocol for the association: This could be done using PEP, in a similar fashion to PICS. It may even be possible or beneficial to use the PICS framework as the infrastructure, and express the policy as a rating.

Note that this can be deployed independently of, and can be used together with a site '/robots.txt'.

I suggest the first option should be an immediate first step, with the other options possibly following later.

Meta information

The same three steps can be used for descriptive META information:

Embedding the meta information in the resource itself: This could done using the META tag, e.g. <META NAME="description" CONTENT="...">. The nature of the META information could be the Dublin core set, or even just "description" and "keywords". While this would only work for HTML, it would be extremely easy for a user to add this information to their documents. No software or administrative access is required for the user, and it is really easy to support in the robot.
Embedding a reference to the policy in the resource: This could be done using the LINK tag, e.g. <LINK REL="meta" HREF="doc.meta"> This would give the extra flexibility of sharing meta information among documents, and supporting different meta encodings which could move beyond RFC822-like syntax (which can even be negotiated using HTTP content type negotiation!) The drawback is increased traffic (using regular caching) and complexity.
Using an explicit protocol for the association: This could be done using PEP, in a similar fashion to PICS. It may even be possible or beneficial to use the PICS framework as the infrastructure, and express the meta information as a rating.

I suggest the first option should be an immediate first step, with the other options possibly following later.

Extending the SRE

The meaures above address some of the problems in the SRE in a more scaleable and flexible way than by adding a multitude of directives to the '/robots.txt' file.

I believe that of the suggested additions, this one will have the most benefit, without adding complexity:

PleaseVisit: To suggest relative URL's to visit on the site

Standards...

I believe any future version of the SRE should be documented either as an RFC or a W3C-backed standard.

The Web Robots Pages