Protocol Gives Sites Way To Keep Out The 'Bots

By Jeremy Carl

Some advanced Web practitioners at heavily trafficked sites -- and sites with sensitive data -- are turning to a public-domain technology to keep automatically dispatched robots from traversing their servers.

The technology, called the robot exclusion protocol, has been around for more than a year but is just starting to be widely used. The protocol serves as a sort of "Do Not Enter" sign for robots, or 'bots, which are automated software programs that patrol the Web, sending back information to central servers.

Robots are a key component of the Web search technology used in such Internet search engines as Lycos and InfoSeek. But not all sites want to be indexed, and more and more sites are using the protocol to keep certain pages away from the tireless bots.

Tim Bray, vice president of technology for Open Text, whose Web Index respects the protocol, said there are a variety of reasons why people might want to avoid being indexed.

"Some sites, such as newspaper sites, are so volatile that they change every 20 minutes or so," he said. "Any robot that happens to be searching on that site is likely to come up with a dead or useless link by the time it's entered in the site index. Also, some URLs aren't files at all, they are simply instructions to other robots."

Server load, which can be increased by robots crawling on one's site, is another reason why some sites have chosen to forbid robots.

"We've had such a problem with the demand being so high, that we have had to restrict information to our regular users," said Andy Mitchell, a publicist at CNN. "In the future, when we expand capacity, we will allow robots into the site."

Of course, such banning is not a cure-all for server overload. Despite banning bots, CNN's site was still so overloaded in the minutes after the controversial verdict in the O.J. Simpson murder trial that it was virtually inaccessible.

Bray said another problem for search robots is synthetically generated URLs, which can be infinite in number. For example, when doing an Open Text search, one has the opportunity to search for pages similar to a page already found, generating an on-the-fly URL of a synthetic index page. Because there is an infinite variety of potential search queries, there is also an infinite variety of pages that can be generated on the fly. However, none of these URLs have any meaning beyond that local search, so any robot picking them up would be wasting time and clogging the search index.

So ironically, Open Text and many other sites that make their living using search robots must ban them from much of their own sites. "We will actually generate a whole series of synthetic URLs on the fly for page similarity and dynamic sampling," said Bray. "They are only meaningful so long as they are on your screen."

The protocol was devised in early 1994, when Martin Koster, then of the Nexor Corporation and now an employee of AOL's WebCrawler project, developed it of his own initiative. The protocol is simply a file on the server with a list of names of the user agents (robots) and the paths they go by. The file can exclude certain types of robots, as well as excluding robots from certain areas of a particular site.

Koster said that in recent months many users of the robot exclusion protocol have asked him to help create a type of robot guidance protocol. This new protocol would combine robot exclusion with the basic ideas of Aliweb, a service pioneered by Koster in late 1993. Aliweb was a technology that allowed robots to access site summaries designed by site administrators on a daily basis. Because this was done only once per day per robot, this reduced server load. By taking advantage of writing done by a human being, Aliweb typically provided a more useful summary than the jumble of text cobbled together by a robot.

Koster is currently working on combining the code for robot exclusion and robot guidance, which has proved to be a rather thorny task. Generally, though, users such as Bray are already quite pleased with Koster's invention. "No one has come up to me and said 'Martin, this is a really stupid idea,'" Koster said.

Another reason to use the robot exclusion protocol is to keep robots out of site areas that are under development. In at least one recent case, the penetration of an under-development area was what made a site administrator aware of the protocol. That happened when some Webmasters at Netscape Communications discovered that a still-evolving portion of their site had turned up on Architext's excite index. A hurried call from Netscape to Architext wanting to know how Architext's engine had found their data resulted in Netscape's using robot exclusion on appropriate pages.

Search-engine sites may also block out robots from meta-search engines (engines that search other search engines). These meta-search engines (such as SavvySearch) take data from regular search engines without forcing users to use that search site, thus depriving the creators of the original engine of valuable advertising revenue and linking ability.

And despite the fact that it restricts them, search sites seem likely to continue to respect the protocol. "The last thing in the world we want is to develop a reputation on the Net as being noncompliant with accepted protocols," said Andy Bensky, a software developer at InfoSeek.

While the robot exclusion protocol works well for now, Web site developers should not be fooled into believing that the protocols will keep all of their data safe from prying eyes. Following the protocol is purely voluntary, and any search engine can choose to ignore it.

"While we won't bother you, if the FBI is trailing a serial killer, or the CIA wants some information on you, they're not going to worry too much about robot exclusion protocols," said Open Text's Bray.

Reprinted from Web Week, Volume 1, Issue 7, November 1995 © Mecklermedia Corp. All rights reserved. Keywords: agents standards Date: 19951101

http://www.iworld.com