wget not respecting my robots.txt. Is there an interceptor?

Posted by Jane Wilkie on Pro Webmasters See other posts from Pro Webmasters or by Jane Wilkie
Published on 2011-06-29T17:55:40Z Indexed on 2011/06/30 0:30 UTC
Read the original article Hit count: 278

Filed under:

robots.txt

I have a website where I post csv files as a free service. Recently I have noticed that wget and libwww have been scraping pretty hard and I was wondering how to circumvent that even if only a little.

I have implemented a robots.txt policy. I posted it below..

User-agent: wget
Disallow: /

User-agent: libwww
Disallow: /

User-agent: *
Disallow: /

Issuing a wget from my totally independent ubuntu box shows that wget against my server just doesn't seem to work like so....

http://myserver.com/file.csv

Anyway I don't mind people just grabbing the info, I just want to implement some sort of flood control, like a wrapper or an interceptor.

Does anyone have a thought about this or could point me in the direction of a resource. I realize that it might not even be possible. Just after some ideas.

Janie

Related posts about robots.txt

Robots.txt practices with .htaccess redirections (inherits)

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I have a question regarding how to write robots.txt files for many domains and subdomains with redirects in place. We have a hosting account that enacts primary and add-on domains. All of our domains and subdomains, including the primary domain, is redirected via htaccess 301s to their own subdirectories… >>> More
mod evasive not working properly on ubuntu 10.04

as seen on Server Fault - Search for 'Server Fault'
I have an ubuntu 10.04 server where I installed mod_evasive using apt-get install libapache2-mod-evasive I already tried several configurations, the result stays the same. The blocking does work, but randomly. I tried with low limis and long blocking periods as well as short limits. The behaviour… >>> More
Cross-domain jQuery using YQL gives robots.txt error

as seen on Stack Overflow - Search for 'Stack Overflow'
On the page http://qxlapps.dk/test.htm I am trying to perform an Ajax load from another domain, qxlapp.dk. I am using James Padolsey's xdomainajax.js plugin from: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/ When I open my test page, I get no output, but FireBug shows… >>> More
Asterisk in robots.txt

as seen on Stack Overflow - Search for 'Stack Overflow'
Wondering if following will work for google in robots.txt Disallow: /*.action I need to exclude all urls ending with .action. Is this correct? >>> More
SEO chaos from changing robots.txt file in Wordpress site

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi there, I recently edited the robots.txt file in my site using a wordpress plugin. However, since i did this, google seems to have removed my site from their search page. I'd appreciate if I could get an expert opinion on why this is so, and a possible solution. I'd initially done it to increase… >>> More

Developer IT