How to protect SHTML pages from crawlers/spiders/scrapers?

Posted by Adam Lynch on Pro Webmasters See other posts from Pro Webmasters or by Adam Lynch
Published on 2011-05-13T19:40:40Z Indexed on 2012/04/10 17:46 UTC
Read the original article Hit count: 385

Filed under:

security

|

scraper-sites

I have A LOT of SHTML pages I want to protect from crawlers, spiders & scrapers.

I understand the limitations of SSIs. An implementation of the following can be suggested in conjunction with any technology/technologies you wish:

The idea is that if you request too many pages too fast you're added to a blacklist for 24 hrs and shown a captcha instead of content, upon every page you request. If you enter the captcha correctly you've removed from the blacklist.
There is a whitelist so GoogleBot, etc. will never get blocked.

Which is the best/easiest way to implement this idea?

Server = IIS

Cleaning out the old tuples from a DB every 24 hrs is easily done so no need to explain that.

© Pro Webmasters or respective owner

Related posts about security

sudo apt-get update errors

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
Here is what I get on my terminal when running sudo apt-get update errors. I dont know if the issue is from my sources.list or my proxy setup(have not made any changes to proxies). Thank you for any help in advanced. Ign http://security.ubuntu.com oneiric-security Release.gpg Ign http://security… >>> More
The Security-Developer Gap: Why IT Security Can't Fix Your Security Problems Alone

as seen on Devx - Search for 'Devx'
Two well-known white hats explain how hackers take advantage of security holes left by enterprise application developers. >>> More
The Security-Developer Gap: Why IT Security Can't Fix Your Security Problems Alone

as seen on Internet.com - Search for 'Internet.com'
Two well-known white hats explain how hackers take advantage of security holes left by enterprise application developers. >>> More
StackOverFlowError while creating Mac object on AS400/Java

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello all, I am a newbie to AS400-Java programming. I am trying to create my first program to test the implementation of Message Authentication Code (MAC). I am trying to use the HMACSHA1 hash function. My (Java 1.4) program runs fine on a dev box (V5R4).But fails terribly on the QA box (V5R3). My… >>> More
Google App Engine - Spring Security Issue (java.security.AccessControlException)

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm currently getting the AccessControlException below when I deploy to app engine (I don't see it when I run in my local environment). I'm using GAE 1.3.1, Spring 3.0.1, and Spring Security 3.0.2. Any ideas how to get around this issue? It appears to be an issue with Spring Security trying to get… >>> More

Related posts about scraper-sites

Scrapy cannot find div on this website [on hold]

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I am very new at this and have been trying to get my head around my first selector can somebody help? i am trying to extract data from page http://groceries.asda.com/asda-webstore/landing/home.shtml?cmpid=ahc--ghs-d1--asdacom-dsk-_-hp#/shelf/1215337195041/1/so_false all the info under div class =… >>> More
Contents farms, scrapers sites, aggregators real world examples? [closed]

as seen on Pro Webmasters - Search for 'Pro Webmasters'
Contents farm, scrappers, aggregators real world examples? Could you plz clarify me: efreedom.com is a scraper site, not a content farm? Because it simply copies and pastes contents from stackoverflow. ehow.com and squidoo.com are contents farm? They don't copy and paste contents they just generate… >>> More
How to protect SHTML pages from crawlers/spiders/scrapers?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I have A LOT of SHTML pages I want to protect from crawlers, spiders & scrapers. I understand the limitations of SSIs. An implementation of the following can be suggested in conjunction with any technology/technologies you wish: The idea is that if you request too many pages too fast you're… >>> More
Is this Anti-Scraping technique viable with Crawl-Delay?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I want to prevent web scrapers from abusing 1,000,000 on my website. I'd like to do this by returning a "503 Service Unavailable" error code for users that access an abnormal number of pages per minute. I don't want search engine spiders to ever receive the error. My inclination is to set a robots… >>> More
How can I return a whole mysql result set, and one at a time set each row as an array?

as seen on Stack Overflow - Search for 'Stack Overflow'
foreach($scraperSites as $site) { //$scraperWriter->addSite( new ScraperSite($site) ); print_r($site); } scraperSites is the array of all sites from the mySQL database; I'm trying to keep $site as an array, (but only with one row worth of data), add it to an object, then move on to… >>> More