Web scraping etiquette
- by Ash
I'm considering writing a simple web scraping application to extract information from a website that does not seem to specifically prohibit this.
I've checked for other alternatives (eg RSS, web service) to get this information, but there are none available at this stage.
Despite this I've also developed/maintained a few websites myself and so I realize that if web scraping is done naively/greedily it can slow things down for other users and generally become a nuisance.
So, what etiquette is involved in terms of:
Number of requests per second/minute/hour.
HTTP User Agent content.
HTTP Referer content.
HTTP Cache settings.
Buffer size for larger files/resources.
Legalities and licensing issues.
Good tools or design approaches to use.
Robots.txt, is this relevant for web scraping or just crawlers/spiders?
Compression such as GZip in requests.
Update
Found this relevant question on Meta: Etiquette of Screen Scaping StackOverflow. Jeff Atwood's answer has some helpful recommendations.
Other related StackOverflow questions:
Options for html scraping
Legalities of screen scraping