The Sitemap Paradox
- by Jeff Atwood
We use a sitemap on Stack Overflow, but I have mixed feelings about it.
Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.
Based on our two years' experience with sitemaps, there's something fundamentally paradoxical about the sitemap:
Sitemaps are intended for sites that are hard to crawl properly.
If Google can't successfully crawl your site to find a link, but is able to find it in the sitemap it gives the sitemap link no weight and will not index it!
That's the sitemap paradox -- if your site isn't being properly crawled (for whatever reason), using a sitemap will not help you!
Google goes out of their way to make no sitemap guarantees:
"We cannot make any predictions or guarantees about when or if your URLs will be crawled or added to our index" citation
"We don't guarantee that we'll crawl or index all of your URLs. For example, we won't crawl or index image URLs contained in your Sitemap." citation
"submitting a Sitemap doesn't guarantee that all pages of your site will be crawled or included in our search results" citation
Given that links found in sitemaps are merely recommendations, whereas links found on your own website proper are considered canonical ... it seems the only logical thing to do is avoid having a sitemap and make damn sure that Google and any other search engine can properly spider your site using the plain old standard web pages everyone else sees.
By the time you have done that, and are getting spidered nice and thoroughly so Google can see that your own site links to these pages, and would be willing to crawl the links -- uh, why do we need a sitemap, again? The sitemap can be actively harmful, because it distracts you from ensuring that search engine spiders are able to successfully crawl your whole site. "Oh, it doesn't matter if the crawler can see it, we'll just slap those links in the sitemap!" Reality is quite the opposite in our experience.
That seems more than a little ironic considering sitemaps were intended for sites that have a very deep collection of links or complex UI that may be hard to spider. In our experience, the sitemap does not help, because if Google can't find the link on your site proper, it won't index it from the sitemap anyway. We've seen this proven time and time again with Stack Overflow questions.
Am I wrong? Do sitemaps make sense, and we're somehow just using them incorrectly?