Web crawler update strategy

Posted by superb on Stack Overflow See other posts from Stack Overflow or by superb
Published on 2010-04-05T03:28:55Z Indexed on 2010/04/05 3:33 UTC
Read the original article Hit count: 665

Filed under:
|
|
|

I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy.

The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated resource. So I want to know is there any good strategy for a web crawler to get updated pages?

Here's a coarse algorithm I've thought of. I divided the crawl process into rounds. Each round URL repository will give crawler a certain number (like , 10000) of URLs to crawl. And then next round. The detailed steps are:

  1. crawler add start URLs to URL repository
  2. crawler ask URL repository for at most N URL to crawl
  3. crawler fetch the URLs, and update certain information in URL repository, like the page content, the fetch time and whether the content has been changed.
  4. just go back to step 2

To further specify that, I still need to solve following question: How to decide the "refresh-ness" of a web page, which indicates the probability that this web page has been updated ?

Since that is an open question, hopefully it will brought some fruitful discussion here.

© Stack Overflow or respective owner

Related posts about web

Related posts about crawler