Web crawler update strategy

Posted by superb on Stack Overflow See other posts from Stack Overflow or by superb
Published on 2010-04-05T03:28:55Z Indexed on 2010/04/05 3:33 UTC
Read the original article Hit count: 705

Filed under:

web

|

crawler

|

update

|

strategy

I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy.

The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated resource. So I want to know is there any good strategy for a web crawler to get updated pages?

Here's a coarse algorithm I've thought of. I divided the crawl process into rounds. Each round URL repository will give crawler a certain number (like , 10000) of URLs to crawl. And then next round. The detailed steps are:

crawler add start URLs to URL repository
crawler ask URL repository for at most N URL to crawl
crawler fetch the URLs, and update certain information in URL repository, like the page content, the fetch time and whether the content has been changed.
just go back to step 2

To further specify that, I still need to solve following question: How to decide the "refresh-ness" of a web page, which indicates the probability that this web page has been updated ?

Since that is an open question, hopefully it will brought some fruitful discussion here.

© Stack Overflow or respective owner

Related posts about web

Why is Java EE 6 better than Spring ?

as seen on Oracle Blogs - Search for 'Oracle Blogs'
Java EE 6 was released over 2 years ago and now there are 14 compliant application servers. In all my talks around the world, a question that is frequently asked is Why should I use Java EE 6 instead of Spring ? There are already several blogs covering that topic: Java EE… >>> More
Hosting a website on Heroku.... I know how to, but im running into problems!

as seen on Pro Webmasters - Search for 'Pro Webmasters'
I'm starting to learn more on the back-end scale of programing. Recently I started up Heroku for the second or third time. This time I actually installed the Git update to my Mac and installed Heroku in the terminal. I wanted to upload a static html site with the sinatra gem. Everything worked out… >>> More
Microsoft .NET Web Programming: Web Sites versus Web Applications

as seen on Samir ASP.NET with C# Technology - Search for 'Samir ASP.NET with C# Technology'
In .NET 2.0, Microsoft introduced the Web Site. This was the default way to create a web Project in Visual Studio 2005. In Visual Studio 2008, the Web Application has been restored as the default web Project in Visual Studio/.NET 3.x The Web Site is a file/folder based Project structure. It… >>> More
VS2008 - Unable to Add Web Reference to Web Application Project (The web services enumeration compon

as seen on Stack Overflow - Search for 'Stack Overflow'
I've run into a situation where I was unable to add a Web Reference in Visual Studio 2008 to a Web Application Project. The error I couldn't resolve was "The web services enumeration components are not available. You need to reinstall Visual Studio to add web references to your application." How… >>> More
Outlook Web Access: "Outlook Web Access has encountered a Web browsing error"

as seen on Super User - Search for 'Super User'
When one of my colleagues is accessing Outlook Web Access from IE, he frequently gets an error reported: "Outlook Web Access has encountered a Web browsing error". The error report includes the following: Client Information User Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4… >>> More

Related posts about crawler

Site crawler/spider that tosses results into mysql

as seen on Server Fault - Search for 'Server Fault'
It's been suggested that we use mysql for our site's search as it'd be running on the same server that hosts our web server (nginx) and our db (mysql). Since not all of our pages are created from the database, it's been suggested that we have a crawler that can crawl the site, and toss the page url… >>> More
Remove subdomain from Google Crawler

as seen on Server Fault - Search for 'Server Fault'
Hi all, I recently removed a sub-domain from my domain so I just have 1 website to manage. However, if I do a google search, my old domain is still there, I removed the sub-domain well over a week ago and if you try to access the domain directly, you will get an error saying the website can not… >>> More
Is there an automated way to take site inventory?

as seen on Pro Webmasters - Search for 'Pro Webmasters'
Is there a way to take site inventory using a crawler program that checks either the sources of images for specific servers that serve ads, or, that the crawler looks at a page for specific (html5?) tags like <aside> or some other tag to count the inventory of ad spaces available on a site?… >>> More
Building an automatic web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is… >>> More
What is a good Java crawler library?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi, I am about to develop a crawler in Java but don't feel like reinventing the wheel. A quick Google search gives a whole bunch of Java libraries to build a web crawler. Besides that Nutch is of course a very robust package but seems a bit too advanced for my needs. I only need to crawl a handful… >>> More