guide on crawling the entire web ?
Posted
by bohohasdhfasdf
on Stack Overflow
See other posts from Stack Overflow
or by bohohasdhfasdf
Published on 2010-01-17T08:10:30Z
Indexed on
2010/06/03
16:54 UTC
Read the original article
Hit count: 213
webcrawling
i just had this thought, and was wondering if it's possible to crawl the entire web (just like the big boys!) on a single dedicated server (like Core2Duo, 8gig ram, 750gb disk 100mbps) .
I've come across a paper where this was done....but i cannot recall this paper's title. it was like about crawling the entire web on a single dedicated server using some statistical model.
Anyways, imagine starting with just around 10,000 seed URLs, and doing exhaustive crawl....
is it possible ?
I am in need of crawling the web but limited to a dedicated server. how can i do this, is there an open source solution out there already ?
for example see this real time search engine. http://crawlrapidshare.com the results are exteremely good and freshly updated....how are they doing this ?
© Stack Overflow or respective owner