De-index URL parameters by value
- by Doug Firr
Upon reading over this question is lengthy so allow me to provide a one sentence summary: I need to get Google to de-index URLs that have parameters with certain values appended
I have a website example.com with language translations.
There used to be many translations but I deleted them all so that only English (Default) and French options remain.
When one selects a language option a parameter is aded to the URL. For example, the home page:
https://example.com (default)
https://example.com/main?l=fr_FR (French)
I added a robots.txt to stop Google from crawling any of the language translations:
# robots.txt generated at http://www.mcanerin.com
User-agent: *
Disallow:
Disallow: /cgi-bin/
Disallow: /*?l=
So any pages containing "?l=" should not be crawled. I checked in GWT using the robots testing tool. It works.
But under html improvements the previously crawled language translation URLs remain indexed. The internet says to add a 404 to the header of the removed URLs so the Googles knows to de-index it.
I checked to see what my CMS would throw up if I visited one of the URLs that should no longer exist.
This URL was listed in GWT under duplicate title tags (One of the reasons I want to scrub up my URLS)
https://example.com/reports/view/884?l=vi_VN&l=hy_AM
This URL should not exist - I removed the language translations. The page loads when it should not! I played around. I typed example.com?whatever123
It seems that parameters always load as long as everything before the question mark is a real URL.
So if Google has indexed all these URLS with parameters how do I remove them? I cannot check if a 404 is being generated because the page always loads because it's a parameter that needs to be de-indexed.