Web scraping etiquette
Posted
by Ash
on Stack Overflow
See other posts from Stack Overflow
or by Ash
Published on 2010-01-07T16:56:37Z
Indexed on
2010/05/25
12:51 UTC
Read the original article
Hit count: 578
I'm considering writing a simple web scraping application to extract information from a website that does not seem to specifically prohibit this.
I've checked for other alternatives (eg RSS, web service) to get this information, but there are none available at this stage.
Despite this I've also developed/maintained a few websites myself and so I realize that if web scraping is done naively/greedily it can slow things down for other users and generally become a nuisance.
So, what etiquette is involved in terms of:
- Number of requests per second/minute/hour.
- HTTP User Agent content.
- HTTP Referer content.
- HTTP Cache settings.
- Buffer size for larger files/resources.
- Legalities and licensing issues.
- Good tools or design approaches to use.
- Robots.txt, is this relevant for web scraping or just crawlers/spiders?
- Compression such as GZip in requests.
Update
Found this relevant question on Meta: Etiquette of Screen Scaping StackOverflow. Jeff Atwood's answer has some helpful recommendations.
Other related StackOverflow questions:
© Stack Overflow or respective owner