What are the best measures to protect content from being crawled?

Posted by Moak on Stack Overflow See other posts from Stack Overflow or by Moak
Published on 2011-02-08T07:13:34Z Indexed on 2011/02/08 7:25 UTC
Read the original article Hit count: 462

Filed under:

html

|

security

|

data

|

web-crawler

|

webscraping

I've been crawling a lot of websites for content recently and am surprised how no site so far was able to put up much resistance. Ideally the site I'm working on should not be able to be harvested so easily. So I was wondering what are the best methods to stop bots from harvesting your web content. Obvious solutions:

Robots.txt (yea right)
IP blacklists

What can be done to catch bot activity? What can be done to make data extraction difficult? What can be done to give them crap data?
Just looking for ideas, no right/wrong answer

© Stack Overflow or respective owner

Related posts about html

Install usblib package - Ubuntu

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I need the package libusb for another package I am installing. I tried the following which seemed to install the package, sudo apt-get install libusb-dev but when I try to install the other package I get, configure: error: Package requirements (libusb-1.0 >= 0.9.1) were not met: No package… >>> More
Prevent malicious vulnerability scan increasing load on a server

as seen on Server Fault - Search for 'Server Fault'
Hi all, this week we have been suffering some malicious vulnerability scans to our servers, increasing the load on them, making them nearly unusable. The attack is easy to defend, just blocking the offending ip, but only after discovering it. Is there any form of prevent it? Is it normal that… >>> More
can't install psycopg2 in my env on mac os x lion

as seen on Server Fault - Search for 'Server Fault'
I tried install psycopg2 via pip in my virtual env, but got this error: ld: library not found for -lpq (full log here: http://pastebin.com/XdmGyJ4u ) I tried install postgres 9.1 from .dmg and via port, (gksks)iMac-Alexander:~ lorddaedra$ locate libpq /Developer/SDKs/MacOSX10.7.sdk/usr/include/libpq /Developer/SDKs/MacOSX10… >>> More
Bitnami redmine error SVN

as seen on Server Fault - Search for 'Server Fault'
I'm installing the Bitnami Redmine stack (redmine + subversion). Firstly I install configure and test it locally (Ubuntu 14.04 LTS). And everything is OK. I install Bitnami stack on server (Red Hat 4.4.7-4) and configure SVN. I commit files into SVN and connect project into Redmine with SVN repository… >>> More
Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
here is my htaccess file...... RewriteCond %{REQUEST_URI} ^/patients/billing/FAQ_billing\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/billing/getintouch\.html$ RewriteRule ^patients/billing/(.*)\.html$ $1.php [L,NC] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/a\.html$ [OR] RewriteCond… >>> More

Related posts about security

sudo apt-get update errors

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
Here is what I get on my terminal when running sudo apt-get update errors. I dont know if the issue is from my sources.list or my proxy setup(have not made any changes to proxies). Thank you for any help in advanced. Ign http://security.ubuntu.com oneiric-security Release.gpg Ign http://security… >>> More
The Security-Developer Gap: Why IT Security Can't Fix Your Security Problems Alone

as seen on Devx - Search for 'Devx'
Two well-known white hats explain how hackers take advantage of security holes left by enterprise application developers. >>> More
The Security-Developer Gap: Why IT Security Can't Fix Your Security Problems Alone

as seen on Internet.com - Search for 'Internet.com'
Two well-known white hats explain how hackers take advantage of security holes left by enterprise application developers. >>> More
StackOverFlowError while creating Mac object on AS400/Java

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello all, I am a newbie to AS400-Java programming. I am trying to create my first program to test the implementation of Message Authentication Code (MAC). I am trying to use the HMACSHA1 hash function. My (Java 1.4) program runs fine on a dev box (V5R4).But fails terribly on the QA box (V5R3). My… >>> More
Google App Engine - Spring Security Issue (java.security.AccessControlException)

as seen on Stack Overflow - Search for 'Stack Overflow'
I'm currently getting the AccessControlException below when I deploy to app engine (I don't see it when I run in my local environment). I'm using GAE 1.3.1, Spring 3.0.1, and Spring Security 3.0.2. Any ideas how to get around this issue? It appears to be an issue with Spring Security trying to get… >>> More