Robots.txt practices with .htaccess redirections (inherits)

Posted by Jayhal on Pro Webmasters See other posts from Pro Webmasters or by Jayhal
Published on 2011-08-02T14:34:54Z Indexed on 2012/10/04 9:51 UTC
Read the original article Hit count: 536

Filed under:

web-crawlers

I have a question regarding how to write robots.txt files for many domains and subdomains with redirects in place.

We have a hosting account that enacts primary and add-on domains. All of our domains and subdomains, including the primary domain, is redirected via htaccess 301s to their own subdirectories in the primary domain's root directory.

I'm confused about how I would write the robots.txt for certain directories. First, I wanted to confirm I am right in understanding that for domains and subdomains, crawlers will look to the directory that acts as that urls root directory for the crawling rules(robots.txt). Also, that a directory will not be affected by a robots.txt present in their parent directory if the directory has its own domain/subdomain, and that url is the one being accessed by crawlers. (Am pretty sure, but I wanted to confirm I didnt have a fundamentally flawed understanding of robots.txt)

In the original root directory on the account(where the primary domain was directed before htaccess was put in place) what should the robots.txt contain? When crawlers look to crawl our primary domain, will they look to the original root directory for the robots.txt or will they reference the file contained in the new subdirectory where all the primary domain's site files are located? If so, what should the root's robot.txt include if anything at all.

Would I be right to include a simple 'disallow: /' for all agents, and then include more specific robots.txt files in each subdirectory with more specific instructions. Would that affect the crawling of the directory where the primary domain is now redirected?

Any help is greatly appreciated, Thanks!

Developer IT