Legality, terms of service for performing a web crawl

Posted by Berlin Brown on Stack Overflow See other posts from Stack Overflow or by Berlin Brown
Published on 2010-01-12T02:32:45Z Indexed on 2010/04/25 4:23 UTC
Read the original article Hit count: 428

Filed under:

crawling

|

web-crawler

|

spider

|

ethics

|

legal

I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do?

Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause.

Restrictions: "use any robot, spider, site search application, or other automated device, process or means to access, retrieve, scrape, or index the site"

It is just research?

Edit: "OK, from the standpoint of designing an efficient crawler. Should I provide some form of natural language engine to read terms of service and then abide by them."

© Stack Overflow or respective owner

Related posts about crawling

https & ajax crawling

as seen on Pro Webmasters - Search for 'Pro Webmasters'
We made on our webpage https://www.1point618.com a transition to ssl and now we using nearly entirely ajax to load the content. Therefore all urls of existing pages have changed. We used the 301 redirect as recommended, also we have implemented google's specification that the webpage is still crawl-able… >>> More
Ajax site not being crawled - have escaped fragment, what's wrong? [closed]

as seen on Pro Webmasters - Search for 'Pro Webmasters'
My site is anonkun.com. You can see that it's "ajax" and doesn't load much HTML. Here are some example pages: http://anonkun.com http://anonkun.com/?_escaped_fragment_= http://anonkun.com/stories/Dev-kun---FAQ/6ef881f8-cf48-4f87-a688-c585f23809c5 http://anonkun.com/stories/Dev-kun---FAQ/6ef881f8-cf48-4f87-a688-c585f23809c5… >>> More
Need to sanity-check my .htaccess, especially Limit GET POST line for Google repellent

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I need a sanity check on this .htaccess (from a WordPress site) I inherited from a 5 month+ old site. What's the symptom? Google + Bing crawl, but don't index any of the pages. Let me be clear: I'm not mad about "not ranking high." I think something is (accidentally) rejecting search engine indexing… >>> More
Understanding Ajax crawling of search site

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I have a couple of questions about Ajax crawling of site, which is kind of search engine itself. The base article explains the mechanism of making AJAX application crawlable. All this stuff with HTML-snapshots is clear and easy to implement, but I cant understand where will Google bot will get "the… >>> More
How to interpret number of URL errors in Google webmaster tools

as seen on Pro Webmasters - Search for 'Pro Webmasters'
Recently Google has made some changes to Webmaster tools which are explained below: http://googlewebmastercentral.blogspot.com/2012/03/crawl-errors-next-generation.html One thing I could not find out is how to interpret the number of errors over time. At the end of February we've recently migrated… >>> More

Related posts about web-crawler

web crawler needed

as seen on Stack Overflow - Search for 'Stack Overflow'
does anybody know where i can get a free web crawler that actually works with minimal coding by me. ive googled it and can only find really old ones that dont work or openwebspider which doesnt seem to work. ideally id like to store just the web addresses and which links that page contains any suggestions… >>> More
Building an automatic web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is… >>> More
Appengine Apps Vs Google bot web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
i built an appengine web app cricket.hover.in. The web app consists of about 15k url's linked in it, But even after a long time of my launch, no pages are indexed on google. Any base link place on my root site hover.in are being indexed with in minutes. but i placed the same link home page of root… >>> More
Extracting data from internet

as seen on Programmers - Search for 'Programmers'
I would like to extract data from internet like www.mozenda.com does but I want to write my own program to do that. Specific data I'm looking for is various event data. Based on my research, I think custom web crawler is my answer but I Would like to confirm the answer and see if there are any suggestion… >>> More
Web crawler update strategy

as seen on Stack Overflow - Search for 'Stack Overflow'
I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy. The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated… >>> More