Detecting 'stealth' web-crawlers
Posted
by Jacco
on Stack Overflow
See other posts from Stack Overflow
or by Jacco
Published on 2008-10-24T11:46:52Z
Indexed on
2010/05/08
16:18 UTC
Read the original article
Hit count: 795
What options are there to detect web-crawlers that do not want to be detected?
(I know that listing detection techniques will allow the smart stealth-crawler programmer to make a better spider, but I do not think that we will ever be able to block smart stealth-crawlers anyway, only the ones that make mistakes.)
I'm not talking about the nice crawlers such as googlebot and Yahoo! Slurp. I consider a bot nice if it:
- identifies itself as a bot in the user agent string
- reads robots.txt (and obeys it)
I'm talking about the bad crawlers, hiding behind common user agents, using my bandwidth and never giving me anything in return.
There are some trapdoors that can be constructed updated list (thanks Chris, gs):
- Adding a directory only listed (marked as disallow) in the robots.txt,
- Adding invisible links (possibly marked as rel="nofollow"?),
- style="display: none;" on link or parent container
- placed underneath another element with higher z-index
- detect who doesn't understand CaPiTaLiSaTioN,
- detect who tries to post replies but always fail the Captcha.
- detect GET requests to POST-only resources
- detect interval between requests
- detect order of pages requested
- detect who (consistently) requests https resources over http
- detect who does not request image file (this in combination with a list of user-agents of known image capable browsers works surprisingly nice)
Some traps would be triggered by both 'good' and 'bad' bots. you could combine those with a whitelist:
- It trigger a trap
- It request
robots.txt
? - It doest not trigger another trap because it obeyed
robots.txt
One other important thing here is:
Please consider blind people using a screen readers: give people a way to contact you, or solve a (non-image) Captcha to continue browsing.
What methods are there to automatically detect the web crawlers trying to mask themselves as normal human visitors.
Update
The question is not: How do I catch every crawler. The question is: How can I maximize the chance of detecting a crawler.
Some spiders are really good, and actually parse and understand html, xhtml, css javascript, VB script etc...
I have no illusions: I won't be able to beat them.
You would however be surprised how stupid some crawlers are. With the best example of stupidity (in my opinion) being: cast all URLs to lower case before requesting them.
And then there is a whole bunch of crawlers that are just 'not good enough' to avoid the various trapdoors.
© Stack Overflow or respective owner