Detecting 'stealth' web-crawlers
- by Jacco
What options are there to detect web-crawlers that do not want to be detected?
(I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.)
I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp.
I consider a bot nice if it:
identifies itself as a bot in the user agent string
reads robots.txt (and obeys it)
I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return.
There are some trapdoors that can be constructed updated list (thanks Chris, gs):
Adding a directory only listed (marked as disallow) in the robots.txt,
Adding invisible links (possibly marked as rel="nofollow"?),
style="display: none;" on link or parent container
placed underneath another element with higher z-index
detect who doesn't understand CaPiTaLiSaTioN,
detect who tries to post replies but always fail the Captcha.
detect GET requests to POST-only resources
detect interval between requests
detect order of pages requested
detect who (consistently) requests https resources over http
detect who does not request image file (this in combination with a list of user-agents of known image capable browsers works surprisingly nice)
Some traps would be triggered by both 'good' and 'bad' bots.
you could combine those with a whitelist:
It trigger a trap
It request robots.txt?
It doest not trigger another trap because it obeyed robots.txt
One other important thing here is:
Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing.
What methods are there to automatically detect the web crawlers trying to mask themselves as normal human visitors.
Update
The question is not: How do I catch every crawler. The question is: How can I maximize the chance of detecting a crawler.
Some spiders are really good, and actually parse and understand html, xhtml, css javascript, VB script etc...
I have no illusions: I won't be able to beat them.
You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them.
And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.