Web scraping etiquette

Posted by Ash on Stack Overflow See other posts from Stack Overflow or by Ash
Published on 2010-01-07T16:56:37Z Indexed on 2010/05/25 12:51 UTC
Read the original article Hit count: 662

Filed under:

html

|

best-practices

|

web-development

|

screen-scraping

|

etiquette

I'm considering writing a simple web scraping application to extract information from a website that does not seem to specifically prohibit this.

I've checked for other alternatives (eg RSS, web service) to get this information, but there are none available at this stage.

Despite this I've also developed/maintained a few websites myself and so I realize that if web scraping is done naively/greedily it can slow things down for other users and generally become a nuisance.

So, what etiquette is involved in terms of:

Number of requests per second/minute/hour.
HTTP User Agent content.
HTTP Referer content.
HTTP Cache settings.
Buffer size for larger files/resources.
Legalities and licensing issues.
Good tools or design approaches to use.
Robots.txt, is this relevant for web scraping or just crawlers/spiders?
Compression such as GZip in requests.

Update

Found this relevant question on Meta: Etiquette of Screen Scaping StackOverflow. Jeff Atwood's answer has some helpful recommendations.

Other related StackOverflow questions:

Options for html scraping

Legalities of screen scraping

© Stack Overflow or respective owner

Related posts about html

Install usblib package - Ubuntu

as seen on Ask Ubuntu - Search for 'Ask Ubuntu'
I need the package libusb for another package I am installing. I tried the following which seemed to install the package, sudo apt-get install libusb-dev but when I try to install the other package I get, configure: error: Package requirements (libusb-1.0 >= 0.9.1) were not met: No package… >>> More
Prevent malicious vulnerability scan increasing load on a server

as seen on Server Fault - Search for 'Server Fault'
Hi all, this week we have been suffering some malicious vulnerability scans to our servers, increasing the load on them, making them nearly unusable. The attack is easy to defend, just blocking the offending ip, but only after discovering it. Is there any form of prevent it? Is it normal that… >>> More
can't install psycopg2 in my env on mac os x lion

as seen on Server Fault - Search for 'Server Fault'
I tried install psycopg2 via pip in my virtual env, but got this error: ld: library not found for -lpq (full log here: http://pastebin.com/XdmGyJ4u ) I tried install postgres 9.1 from .dmg and via port, (gksks)iMac-Alexander:~ lorddaedra$ locate libpq /Developer/SDKs/MacOSX10.7.sdk/usr/include/libpq /Developer/SDKs/MacOSX10… >>> More
Bitnami redmine error SVN

as seen on Server Fault - Search for 'Server Fault'
I'm installing the Bitnami Redmine stack (redmine + subversion). Firstly I install configure and test it locally (Ubuntu 14.04 LTS). And everything is OK. I install Bitnami stack on server (Red Hat 4.4.7-4) and configure SVN. I commit files into SVN and connect project into Redmine with SVN repository… >>> More
Can the .htaccess file slow down a website to a crawl? If so, are there better ways to solve these problems with different rewrite rules and such?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
here is my htaccess file...... RewriteCond %{REQUEST_URI} ^/patients/billing/FAQ_billing\.html$ [OR] RewriteCond %{REQUEST_URI} ^/patients/billing/getintouch\.html$ RewriteRule ^patients/billing/(.*)\.html$ $1.php [L,NC] RewriteCond %{REQUEST_URI} ^/patients/findadoctor/a\.html$ [OR] RewriteCond… >>> More

Related posts about best-practices

Batch Best Practices and Technical Best Practices Updated

as seen on Oracle Blogs - Search for 'Oracle Blogs'
The Batch Best Practices for Oracle Utilities Application Framework based products (Doc Id: 836362.1) and Technical Best Practices for Oracle Utilities Application Framework Based Products (Doc Id: 560367.1) have been updated with updated and new advice for the various versions of the Oracle… >>> More
Partial template specialization of free functions - best practices

as seen on Stack Overflow - Search for 'Stack Overflow'
As most C++ programmers should know, partial template specialization of free functions is disallowed. For example, the following is illegal C++: template <class T, int N> T mul(const T& x) { return x * N; } template <class T> T mul<T, 0>(const T& x) { return T(0); } //… >>> More
Object Oriented PHP Best Practices

as seen on Stack Overflow - Search for 'Stack Overflow'
Say I have a class which represents a person, a variable within that class would be $name. Previously, In my scripts I would create an instance of the object then set the name by just using: $object->name = "x"; However, I was told this was not best practice? That I should have a function set_name()… >>> More
Good place to look for example Database Designs - Best practices

as seen on Stack Overflow - Search for 'Stack Overflow'
I have been given the task to design a database to store a lot of information for our company. Because the task is rather big and contains multiple modules where users should be able to do stuff, I'm worried about designing a good data model for this. I just don't want to end up with a badly designed… >>> More
Notification Email Best Practices--From Server Setup to Programming

as seen on Stack Overflow - Search for 'Stack Overflow'
All, I'm in the process now of building a SaaS tool that allows network admins to generate notification emails to the members of the end-users of our platform (among many many other things). I'm running into a bit of an "out of my expertise" wall, as I know there are a lot of variables involved… >>> More