How to get tens of millions of pages indexed by Google bot?
- by Chris Adragna
We are currently developing a site that currently has 8 million unique pages that will grow to about 20 million right away, and eventually to about 50 million or more.
Before you criticize... Yes, it provides unique, useful content. We continually process raw data from public records and by doing some data scrubbing, entity rollups, and relationship mapping, we've been able to generate quality content, developing a site that's quite useful and also unique, in part due to the breadth of the data.
It's PR is 0 (new domain, no links), and we're getting spidered at a rate of about 500 pages per day, putting us at about 30,000 pages indexed thus far. At this rate, it would take over 400 years to index all of our data.
I have two questions:
Is the rate of the indexing directly correlated to PR, and by that I mean is it correlated enough that by purchasing an old domain with good PR will get us to a workable indexing rate (in the neighborhood of 100,000 pages per day).
Are there any SEO consultants who specialize in aiding the indexing process itself. We're otherwise doing very well with SEO, on-page especially, besides, the competition for our "long-tail" keyword phrases is pretty low, so our success hinges mostly on the number of pages indexed.
Our main competitor has achieved approx 20MM pages indexed in just over one year's time, along with an Alexa 2000-ish ranking.
Noteworthy qualities we have in place:
page download speed is pretty good (250-500 ms)
no errors (no 404 or 500 errors when getting spidered)
we use Google webmaster tools and login daily
friendly URLs in place
I'm afraid to submit sitemaps. Some SEO community postings suggest a new site with millions of pages and no PR is suspicious. There is a Google video of Matt Cutts speaking of a staged on-boarding of large sites, too, in order to avoid increased scrutiny (at approx 2:30 in the video).
Clickable site links deliver all pages, no more than four pages deep and typically no more than 250(-ish) internal links on a page.
Anchor text for internal links is logical and adds relevance hierarchically to the data on the detail pages.
We had previously set the crawl rate to the highest on webmaster tools (only about a page every two seconds, max). I recently turned it back to "let Google decide" which is what is advised.