guide on crawling the entire web ?

Posted by bohohasdhfasdf on Stack Overflow See other posts from Stack Overflow or by bohohasdhfasdf
Published on 2010-01-17T08:10:30Z Indexed on 2010/06/03 16:54 UTC
Read the original article Hit count: 270

Filed under:

webcrawling

i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) .

I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model.

Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl....

is it possible ?

I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ?

for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?

Related posts about webcrawling

Asynchronous Webcrawling F#, something wrong ?

as seen on Stack Overflow - Search for 'Stack Overflow'
Not quite sure if it is ok to do this but, my question is: Is there something wrong with my code ? It doesn't go as fast as I would like, and since I am using lots of async workflows maybe I am doing something wrong. The goal here is to build something that can crawl 20 000 pages in less than an hour… >>> More
WebCrawling Dynamic Links

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi Everyone, Anybody has any idea on crawling websites that have dynamic pages/queries? I mean if I click a certain link, it has different values every I try to reload it in a web browser. Now my webcrawler could not download the contents of these pages. Please advise. >>> More
Crawling engine architecture - Java/ Perl integration

as seen on Stack Overflow - Search for 'Stack Overflow'
Hi all, I am looking to develop a management and administration solution around our webcrawling perl scripts. Basically, right now our scripts are saved in SVN and are manually kicked off by SysAdmin/devs etc. Everytime we need to retrieve data from new sources we have to create a ticket with business… >>> More
Building an automatic web crawler

as seen on Stack Overflow - Search for 'Stack Overflow'
I am building a web application crawler that's meant not only to find all the links or pages in a web application, but also perform all the allowed actions in the app (such as pushing buttons, filling forms, notice changes in the DOM even if they did not trigger a request etc.) Basically, this is… >>> More
Getting web page after calling DownloadStringAsync()?

as seen on Stack Overflow - Search for 'Stack Overflow'
Hello I don't know enough about VB.Net yet to use the richer HttpWebRequest class, so I figured I'd use the simpler WebClient class to download web pages asynchronously (to avoid freezing the UI). However, how can the asynchronous event handler actually return the web page to the calling routine… >>> More

Developer IT