What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler?

Posted by Tom on Stack Overflow See other posts from Stack Overflow or by Tom
Published on 2010-12-22T19:31:13Z Indexed on 2010/12/27 4:54 UTC
Read the original article Hit count: 326

Filed under:

When creating a web crawler, you have to design somekind of system that gathers links and add them to a queue. Some, if not most, of these links will be dynamic, which appear to be different, but do not add any value as they are specifically created to fool crawlers.

An example:

We tell our crawler to crawl the domain evil.com by entering an initial lookup URL.

Lets assume we let it crawl the front page initially, evil.com/index

The returned HTML will contain several "unique" links:

evil.com/somePageOne
evil.com/somePageTwo
evil.com/somePageThree

The crawler will add these to the buffer of uncrawled URLs.

When somePageOne is being crawled, the crawler receives more URLs:

evil.com/someSubPageOne
evil.com/someSubPageTwo

These appear to be unique, and so they are. They are unique in the sense that the returned content is different from previous pages and that the URL is new to the crawler, however it appears that this is only because the developer has made a "loop trap" or "black hole".

The crawler will add this new sub page, and the sub page will have another sub page, which will also be added. This process can go on infinitely. The content of each page is unique, but totally useless (it is randomly generated text, or text pulled from a random source). Our crawler will keep finding new pages, which we actually are not interested in.

These loop traps are very difficult to find, and if your crawler does not have anything to prevent them in place, it will get stuck on a certain domain for infinity.

My question is, what techniques can be used to detect so called black holes?

One of the most common answers I have heard is the introduction of a limit on the amount of pages to be crawled. However, I cannot see how this can be a reliable technique when you do not know what kind of site is to be crawled. A legit site, like Wikipedia, can have hundreds of thousands of pages. Such limit could return a false positive for these kind of sites.

Any feedback is appreciated.

Thanks.

Developer IT

What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler? - Developer IT

What techniques can be used to detect so called "black holes" (a spider trap) when creating a web crawler?

web-crawler

crawler

trap

crawling

Related posts about web-crawler

web crawler needed

Building an automatic web crawler

Appengine Apps Vs Google bot web crawler

Extracting data from internet

Web crawler update strategy

Related posts about crawler

Site crawler/spider that tosses results into mysql

Remove subdomain from Google Crawler

Is there an automated way to take site inventory?

Building an automatic web crawler

What is a good Java crawler library?

Categories cloud