How much HDD space would I need to cache the web while respecting robot.txts?
Posted
by Koning Baard XIV
on Server Fault
See other posts from Server Fault
or by Koning Baard XIV
Published on 2010-06-05T12:56:09Z
Indexed on
2010/06/05
13:02 UTC
Read the original article
Hit count: 289
I want to experiment with creating a web crawler. I'll start with indexing a few medium sized website like Stack Overflow or Smashing Magazine. If it works, I'd like to start crawling the entire web. I'll respect robot.txts. I save all html, pdf, word, excel, powerpoint, keynote, etc... documents (not exes, dmgs etc, just documents) in a MySQL DB. Next to that, I'll have a second table containing all restults and descriptions, and a table with words and on what page to find those words (aka an index).
How much HDD space do you think I need to save all the pages? Is it as low as 1 TB or is it about 10 TB, 20? Maybe 30? 1000?
Thanks
© Server Fault or respective owner