Stop bots from crawling old links with extensions

Posted by Jared on Pro Webmasters See other posts from Pro Webmasters or by Jared
Published on 2012-03-31T18:05:44Z Indexed on 2012/04/01 5:39 UTC
Read the original article Hit count: 409

Filed under:

robots.txt

|

robots

|

google-crawlers

|

web-crawlers

I've recently switched to MVC3 which is extension-less for the URL's, but Google and Bing have a wealth of links that they are crawling which no longer exist.

So I'm trying to find out if there is a way to format robots.txt (or by some other method) to tell google/bing that any link that ends in an extension isn't a valid link... Is this possible?

On pages that I'm concerned about a User having saved as a fav I'm displaying a 404 page that lists the links to take once they are redirected to the new page (I decided to not just redirect them as I don't want to maintain these forever). For Google/Bing sake I do have the canonical tag in the header.

User-agent: *
Allow: /
Disallow: /*.*

EDIT: I just added the 3rd line (in text above) and it APPEARS to do what I'm wanting. Allow a path, but disallow a file. Can anyone confirm this?

© Pro Webmasters or respective owner

Related posts about robots.txt

Robots.txt practices with .htaccess redirections (inherits)

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I have a question regarding how to write robots.txt files for many domains and subdomains with redirects in place. We have a hosting account that enacts primary and add-on domains. All of our domains and subdomains, including the primary domain, is redirected via htaccess 301s to their own subdirectories… >>> More
mod evasive not working properly on ubuntu 10.04

as seen on Server Fault - Search for 'Server Fault'
I have an ubuntu 10.04 server where I installed mod_evasive using apt-get install libapache2-mod-evasive I already tried several configurations, the result stays the same. The blocking does work, but randomly. I tried with low limis and long blocking periods as well as short limits. The behaviour… >>> More
Cross-domain jQuery using YQL gives robots.txt error

as seen on Stack Overflow - Search for 'Stack Overflow'
On the page http://qxlapps.dk/test.htm I am trying to perform an Ajax load from another domain, qxlapp.dk. I am using James Padolsey's xdomainajax.js plugin from: http://james.padolsey.com/javascript/cross-domain-requests-with-jquery/ When I open my test page, I get no output, but FireBug shows… >>> More
Asterisk in robots.txt

as seen on Stack Overflow - Search for 'Stack Overflow'
Wondering if following will work for google in robots.txt Disallow: /*.action I need to exclude all urls ending with .action. Is this correct? >>> More
SEO chaos from changing robots.txt file in Wordpress site

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi there, I recently edited the robots.txt file in my site using a wordpress plugin. However, since i did this, google seems to have removed my site from their search page. I'd appreciate if I could get an expert opinion on why this is so, and a possible solution. I'd initially done it to increase… >>> More

Related posts about robots

Robots.txt and "Bad" Robots

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I understand robots.txt and its purpose. I have read some people saying that using a Robots.txt gives "bad" robots or robots who do not obey a robots.txt a way to access pages on your site that you do not want accessed. While I am not looking to get into a debate about that I do have a question: If… >>> More
Ignoring Robots - Or Better Yet, Counting Them Separately

as seen on Oracle Blogs - Search for 'Oracle Blogs'
It is quite common to have web sessions that are undesirable from the point of view of analytics. For example, when there are either internal or external robots that check the site's health, index it or just extract information from it. These robotic session do not behave like humans and if their… >>> More
Can search engine robots read file with permission 640?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I am on a shared web hosting linux server. I want search engine robots/spiders to be able to read the robots.txt but not any one typing www.mysite.com/robots.txt. As per the following google group post, the user specifies that by setting file permission to 640, it's possible to deny access to robots… >>> More
Robots.txt practices with .htaccess redirections (inherits)

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I have a question regarding how to write robots.txt files for many domains and subdomains with redirects in place. We have a hosting account that enacts primary and add-on domains. All of our domains and subdomains, including the primary domain, is redirected via htaccess 301s to their own subdirectories… >>> More
Asterisk in robots.txt

as seen on Stack Overflow - Search for 'Stack Overflow'
Wondering if following will work for google in robots.txt Disallow: /*.action I need to exclude all urls ending with .action. Is this correct? >>> More