What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler?
Posted
by
Tom
on Stack Overflow
See other posts from Stack Overflow
or by Tom
Published on 2010-12-22T19:31:13Z
Indexed on
2010/12/27
4:54 UTC
Read the original article
Hit count: 315
When creating a web crawler, you have to design somekind of system that gathers links and add them to a queue. Some, if not most, of these links will be dynamic, which appear to be different, but do not add any value as they are specifically created to fool crawlers.
An example:
We tell our crawler to crawl the domain evil.com by entering an initial lookup URL.
Lets assume we let it crawl the front page initially, evil.com/index
The returned HTML will contain several "unique" links:
- evil.com/somePageOne
- evil.com/somePageTwo
- evil.com/somePageThree
The crawler will add these to the buffer of uncrawled URLs.
When somePageOne is being crawled, the crawler receives more URLs:
- evil.com/someSubPageOne
- evil.com/someSubPageTwo
These appear to be unique, and so they are. They are unique in the sense that the returned content is different from previous pages and that the URL is new to the crawler, however it appears that this is only because the developer has made a "loop trap" or "black hole".
The crawler will add this new sub page, and the sub page will have another sub page, which will also be added. This process can go on infinitely. The content of each page is unique, but totally useless (it is randomly generated text, or text pulled from a random source). Our crawler will keep finding new pages, which we actually are not interested in.
These loop traps are very difficult to find, and if your crawler does not have anything to prevent them in place, it will get stuck on a certain domain for infinity.
My question is, what techniques can be used to detect so called black holes?
One of the most common answers I have heard is the introduction of a limit on the amount of pages to be crawled. However, I cannot see how this can be a reliable technique when you do not know what kind of site is to be crawled. A legit site, like Wikipedia, can have hundreds of thousands of pages. Such limit could return a false positive for these kind of sites.
Any feedback is appreciated.
Thanks.
© Stack Overflow or respective owner