How much HDD space would I need to cache the web while respecting robot.txts?

Posted by Koning Baard XIV on Server Fault See other posts from Server Fault or by Koning Baard XIV
Published on 2010-06-05T12:56:09Z Indexed on 2010/06/05 13:02 UTC
Read the original article Hit count: 289

Filed under:
|
|
|

I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. I'll respect robot.txts. I save all html, pdf, word, excel, powerpoint, keynote, etc... documents (not exes, dmgs etc, just documents) in a MySQL DB. Next to that, I'll have a second table containing all restults and descriptions, and a table with words and on what page to find those words (aka an index).

How much HDD space do you think I need to save all the pages? Is it as low as 1 TB or is it about 10 TB, 20? Maybe 30? 1000?

Thanks

© Server Fault or respective owner

Related posts about web

Related posts about caching