The Web Robots Pages

Guidelines for Robot Writers

Martijn Koster, 1993

This document contains some suggestions for people who are thinking about developing Web Wanderers (Robots), programs that traverse the Web.

Reconsider

Are you sure you really need a robot? They put a strain on network and processing resources all over the world, so consider if your purpose is really worth it. Also, the purpose for which you want to run your robot are probably not as novel as you think; there are already many other spiders out there. Perhaps you can make use of the data collected by one of the other spiders (check the list of robots and the mailing list). Finally, are you sure you can cope with the results? Retrieving the entire Web is not a scalable solution, it is just too big. If you do decide to do it, don't aim to traverse then entire web, only go a few levels deep.

Be Accountable

If you do decide you want to write and/or run one, make sure that if your actions do cause problems, people can easily contact you and start a dialog. Specifically:
Identify your Web Wanderer
HTTP supports a User-agent field to identify a WWW browser. As your robot is a kind of WWW browser, use this field to name your robot e.g. "NottinghamRobot/1.0". This will allow server maintainers to set your robot apart from human users using interactive browsers. It is also recommended to run it from a machine registered in the DNS, which will make it easier to recognise, and will indicate to people where you are.
Identify yourself
HTTP supports a From field to identify the user who runs the WWW browser. Use this to advertise your email address e.g. "j.smith@somehwere.edu". This will allow server maintainers to contact you in case of problems, so that you can start a dialogue on better terms than if you were hard to track down.
Announce It
Post a message to comp.infosystems.www.providers before running your robots. If people know in advance they can keep an eye out. I maintain a list of active Web Wanderers, so that people who wonder about access from a certain site can quickly check if it is a known robot -- please help me keep it up-to-date by informing me of any missing ones.
Announce it to the target
If you are only targetting a single site, or a few, contact its administrator and inform him/her.
Be informative
Server maintainers often wonder why their server is hit. If you use the HTTP Referer field you can tell them. This costs no effort on your part, and may be informative.
Be there
Don't set your Web Wanderer going and then go on holiday for a couple of days. If in your absence it does things that upset people you are the only one who can fix it. It is best to remain logged in to the machine that is running your robot, so people can use "finger" and "talk" to contact you

Suspend the robot when you're not there for a number of days (in the weekend), only run it in your presence. Yes, it may be better for the performance of the machine if you run it over night, but that implies you don't think about the performance overhead of other machines. Yes, it will take longer for the robot to run, but this is more an indication that robots are not they way to do things anyway, then an argument for running it continually; after all, what's the rush?

Notify your authorities
It is advisable to tell your system administrator / network provider what you are planning to do. You will be asking a lot of the services they offer, and if something goes wrong they like to hear it from you first, not from external people.

Test Locally

Don't run repeated test on remote servers, instead run a number of servers locally and use them to test your robot first. When going off-site for the first time, stay close to home first (e.g. start from a page with local servers). After doing a small run, analyse your performance, your results, and estimate how they scale up to thousands of documents. It may soon become obvious you can't cope.

Don't hog resources

Robots consume a lot of resources. To minimise the impact, keep the following in mind:
Walk, don't run
Make sure your robot runs slowly: although robots can handle hundreds of documents per minute, this puts a large strain on a server, and is guaranteed to infuriate the server maintainer. Instead, put a sleep in, or if you're clever rotate queries between different servers in a round-robin fashion. Retrieving 1 document per minute is a lot better than one per second. One per 5 minutes is better still. Yes, your robot will take longer, but what's the rush, it's only a program.
Use If-modified-since or HEAD where possible
If your application can use the HTTP If-modified-since header, or the HEAD method for its purposes, that gives less overhead than full GETs.
Ask for what you want
HTTP has a Accept field in which a browser (or your robot) can specify the kinds of data it can handle. Use it: if you only analyse text, specify so. This will allow clever servers to not bother sending you data you can't handle and have to throw away anyway. Also, make use of url suffices if they're there.
Ask only for what you want
You can build in some logic yourself: if a link refers to a ".ps", ".zip", ".Z", ".gif" etc, and you only handle text, then don't ask for it. Although they are not the modern way to do things (Accept is), there is an enourmeous installed base out there that uses it (especially FTP sites). Also look out for gateways (e.g. url's starting with finger), News gateways, WAIS gateways etc. And think about other protocols ("news:", "wais:") etc. Don't forget the sub-page references (<A HREF="#abstract">) -- don't retrieve the same page more then once. It's imperative to make a list of places not to visit before you start...
Check URL's
Don't assume the HTML documents you are going to get back are sensible. When scanning for URL be wary of things like <A HREF=" http://somehost.somedom/doc>. A lot of sites don't put the trailing / on urls for directories, a naieve strategy of concatenating the names of sub urls can result in bad names.
Check the results
Check what comes back. If a server refuses a number of documents in a row, check what it is saying. It may be that the server refuses to let you retrieve these things because you're a robot.
Don't Loop or Repeat
Remember all the places you have visited, so you can check that you're not looping. Check to see if the different machine addresses you have are not in fact the same box (e.g. web.nexor.co.uk is the same machine as "hercules.nexor.co.uk" and 128.243.219.1) so you don't have to go through it again. This is imperative.
Run at opportune times
On some systems there are preferred times of access, when the machine is only lightly loaded. If you plan to do many automatic requests from one particular site, check with its administrator(s) when the preferred time of access is.
Don't run it often
How often people find acceptable differs, but I'd say once every two months is probably too often. Also, when you re-run it, make use of your previous data: you know which url's to avoid. Make a list of volatile links (like the what's new page, and the meta-index). Use this to get pointers to other documents, and concentrate on new links -- this way you will get a high initial yield, and if you stop your robot for some reason at least it has spent it's time well.
Don't try queries
Some WWW documents are searcheable (ISINDEX) or contain forms. Don't follow these. The Fish Search does this for example, which may result in a search for "cars" being sent to databases with computer science PhD's, people in the X.500 directory, or botanical data. Not sensible.

Stay with it

It is vital you know what your robot is doing, and that it remains under control
Log
Make sure it provides ample logging, and it wouldn't hurt to keep certain statistics, such as the number of successes/failures, the hosts accessed recently, the average size of recent files, and keep an eye on it. This ties in with the "Don't Loop" section -- you need to log where you have been to prevent looping. Again, estimate the required disk-space, you may find you can't cope.
Be interactive
Arrange for you to be able to guide your robot. Commands that suspend or cancel the robot, or make it skip the current host can be very useful. Checkpoint your robot frequently. This way you don't lose everything if it falls over.
Be prepared
Your robot will visit hundreds of sites. It will probably upset a number of people. Be prepared to respond quickly to their enquiries, and tell them what you're doing.
Be understanding
If your robot upsets someone, instruct it not to visit his/her site, or only the home page. Don't lecture him/her about why your cause is worth it, because they probably aren't in the least interested. If you encounter barriers that people put up to stop your access, don't try to go around them to show that in the Web it is difficult to limit access. I have actually had this happen to me; and although I'm not normally violent, I was ready to strangle this person as he was deliberatly wasting my time. I have written a standard practice proposal for a simple method of excluding servers. Please implement this practice, and respect the wishes of the server maintainers.

Share results

OK, so you are using the resources of a lot of people to do this. Do something back:
Keep results
This may sound obvious, but think about what you are going to do with the retrieved documents. Try and keep as much info as you can possibly store. This will the results optimally useful.
Raw Result
Make your raw results available, from FTP, or the Web or whatever. This means other people can use it, and don't need to run their own servers.
Polished Result
You are running a robot for a reason; probably to create a database, or gather statistics. If you make these results available on the Web people are more likely to think it worth it. And you might get in touch with people with similar interests.
Report Errors
Your robot might come accross dangling links. You might as well publish them on the Web somewhere (after checking they really are. If you are convinced they are in error (as opposed to restricted), notify the administrator of the server.

Examples

This is not intended to be a public flaming forum or a "Best/Worst Robot" league-table. But it shows the problems are real, and the guidelines help aleviate them. He, maybe a league table isn't too bad an idea anyway.

Examples of how not to do it

The robot which retrieved the same sequence of about 100 documents on three occasions in four days. And the machine couldn't be fingered. The results were never published. Sigh.

The robot run from phoenix.doc.ic.ac.uk in Jan 94. It provides no User-agent or From fields, one can't finger the host, and it is not part of a publicly known project. In addition it has been reported to retrieve documents it can't handle. Has since improved.

The Fish search capability added to Mosaic. One instance managed to retrieve 25 documents in under one minute.

Better examples

The RBSE-Spider, run in December 93. It had a User-agent field, and after a finger to the host it was possible to open a dialogue with the robot writers. Their web server explained the purpose of it.

Jumpstation: the results are presented in a searchable database, the author announced it, and is considering making the raw results available. Unfortunately some people complained about the high rate with which documents were retrieved.

Why?

Why am I rambling on about this? Because it annoys me to see that people cause other people unnecessary hassle, and the whole discussion can be so much gentler. And because I run a server that is regularly visited by robots, and I am worried they could make the Web look bad.

This page has been contributed to by Jonathon Fletcher JumpStation Robot author, Lee McLoughlin (L.McLoughlin@doc.ic.ac.uk), and others.


The Web Robots Pages