On the pages of Web Spider Traps is one of the most comprehensive studies I have seen in Robot behavior patterns. With the use of craftily constructed robots.txt files, .htaccess files, detection scripts, and a variety of other techniques this page has detected, proven and found guilty numerous spiders that violate the code of robots.txt and the Robot Meta elements of your web pages.
For instance, Googlebot was shown to have followed it's orders in robots.txt except for files of type pdf, tar and zip. This trap has also caught red-handed in the act www.dir.com (Pompos), Gigabot, ia_archiver, and Yahoo! Slurp to name a few.
Spiders that do not follow robots.txt rules or do not limit bandwidth usage include WebCrawler, Ask Jeeves, MSNbot/0.1, msnbot/0.11, and several others.
Not all bots involved in this study are of the garden search engine variety, other traps for fighting against the "Spam harvesters", "email grabbers", "email collectors" and "spambots" can easily be understood and quite easily done, but as all spiders are not used for bad purposes why should they all be blocked, even if they consume bandwidth and sometimes block or overload some sites.
The site provides a fairly comprehensive list of links to known lists and databases of robots, user-agent/browser strings, search engine robots, IP addresses, and e-mail collectors.
There is also plenty of information on how you too can build your own effective spider trap. Wow, this could be a fun hobby.
This comment has been removed by a blog administrator.
ReplyDelete