What are the best measures to protect content from being crawled?
- by Moak
I've been crawling a lot of websites for content recently and am surprised how no site so far was able to put up much resistance. Ideally the site I'm working on should not be able to be harvested so easily. So I was wondering what are the best methods to stop bots from harvesting your web content.
Obvious solutions:
Robots.txt (yea right)
IP blacklists
What can be done to catch bot activity? What can be done to make data extraction difficult? What can be done to give them crap data?
Just looking for ideas, no right/wrong answer