Why do Google search results include pages disallowed in robots.txt?
Posted
by
Ilmari Karonen
on Pro Webmasters
See other posts from Pro Webmasters
or by Ilmari Karonen
Published on 2012-01-15T00:29:24Z
Indexed on
2012/09/04
3:49 UTC
Read the original article
Hit count: 224
google-search
|robots.txt
I have some pages on my site that I want to keep search engines away from, so I disallowed them in my robots.txt
file like this:
User-Agent: *
Disallow: /email
Yet I recently noticed that Google still sometimes returns links to those pages in their search results. Why does this happen, and how can I stop it?
Background:
Several years ago, I made a simple web site for a club a relative of mine was involved in. They wanted to have e-mail links on their pages, so, to try and keep those e-mail addresses from ending up on too many spam lists, instead of using direct mailto:
links I made those links point to a simple redirector / address harvester trap script running on my own site. This script would return either a 301 redirect to the actual mailto:
URL, or, if it detected a suspicious access pattern, a page containing lots of random fake e-mail addresses and links to more such pages. To keep legitimate search bots away from the trap, I set up the robots.txt
rule shown above, disallowing the entire space of both legit redirector links and trap pages.
Just recently, however, one of the people in the club searched Google for their own name and was quite surprised when one of the results on the first page was a link to the redirector script, with a title consisting of their e-mail address followed by my name. Of course, they immediately e-mailed me and wanted to know how to get their address out of Google's index. I was quite surprised too, since I had no idea that Google would index such URLs at all, seemingly in violation of my robots.txt
rule.
I did manage to submit a removal request to Google, and it seems to have worked, but I'd like to know why and how Google is circumventing my robots.txt
like that and how to make sure that none of the disallowed pages will show up in their search results.
Ps. I actually found out a possible explanation and solution, which I'll post below, while preparing this question, but I thought I'd ask it anyway in case someone else might have the same problem. Please do feel free to post your own answers. I'd also be interested in knowing if other search engines do this too, and whether the same solutions work for them also.
© Pro Webmasters or respective owner