guide on crawling the entire web ?

Posted by bohohasdhfasdf on Stack Overflow See other posts from Stack Overflow or by bohohasdhfasdf
Published on 2010-01-17T08:10:30Z Indexed on 2010/06/03 16:54 UTC
Read the original article Hit count: 213

Filed under:

i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) .

I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model.

Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl....

is it possible ?

I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ?

for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?

© Stack Overflow or respective owner

Related posts about webcrawling