Does the google crawler really guess URL patterns and index pages that were never linked against?
- by Dominik
I'm experiencing problems with indexed pages which were (probably) never linked to. Here's the setup:
Data-Server: Application with RESTful interface which provides the data
Website A: Provides the data of (1) at http://website-a.example.com/?id=RESOURCE_ID
Website B: Provides the data of (1) at http://website-b.example.com/?id=OTHER_RESOURCE_ID
So the whole, non-private data is stored on (1) and the websites (2) and (3) can fetch and display this data, which is a representation of the data with additional cross-linking between those.
In fact, the URL /?id=1 of website-a points to the same resource as /?id=1 of website-b. However, the resource id:1 is useless at website-b. Unfortunately, the google index for website-b now contains several links of resources belonging to website-a and vice versa.
I "heard" that the google crawler tries to determine the URL-pattern (which makes sense for deciding which page should go into the index and which not) and furthermore guesses other URLs by trying different values (like "I know that id 1 exists, let's try 2, 3, 4, ...").
Is there any evidence that the google crawler really behaves that way (which I doubt). My guess is that the google crawler submitted a HTML-Form and somehow got links to those unwanted resources.
I found some similar posted questions about that, including "Google webmaster central:
indexing and posting false pages" [link removed] however, none of those pages give an evidence.