Web scraping etiquette

Posted by Ash on Stack Overflow See other posts from Stack Overflow or by Ash
Published on 2010-01-07T16:56:37Z Indexed on 2010/05/25 12:51 UTC
Read the original article Hit count: 578

I'm considering writing a simple web scraping application to extract information from a website that does not seem to specifically prohibit this.

I've checked for other alternatives (eg RSS, web service) to get this information, but there are none available at this stage.

Despite this I've also developed/maintained a few websites myself and so I realize that if web scraping is done naively/greedily it can slow things down for other users and generally become a nuisance.

So, what etiquette is involved in terms of:

  1. Number of requests per second/minute/hour.
  2. HTTP User Agent content.
  3. HTTP Referer content.
  4. HTTP Cache settings.
  5. Buffer size for larger files/resources.
  6. Legalities and licensing issues.
  7. Good tools or design approaches to use.
  8. Robots.txt, is this relevant for web scraping or just crawlers/spiders?
  9. Compression such as GZip in requests.

Update

Found this relevant question on Meta: Etiquette of Screen Scaping StackOverflow. Jeff Atwood's answer has some helpful recommendations.

Other related StackOverflow questions:

Options for html scraping

Legalities of screen scraping

© Stack Overflow or respective owner

Related posts about html

Related posts about best-practices