large amount of data in many text files - how to process?
Posted
by Stephen
on Stack Overflow
See other posts from Stack Overflow
or by Stephen
Published on 2010-05-30T05:06:28Z
Indexed on
2010/05/30
5:12 UTC
Read the original article
Hit count: 379
Hi, I have large amounts of data (a few terabytes) and accumulating... They are contained in many tab-delimited flat text files (each about 30MB). Most of the task involves reading the data and aggregating (summing/averaging + additional transformations) over observations/rows based on a series of predicate statements, and then saving the output as text, HDF5, or SQLite files, etc. I normally use R for such tasks but I fear this may be a bit large. Some candidate solutions are to 1) write the whole thing in C (or Fortran) 2) import the files (tables) into a relational database directly and then pull off chunks in R or Python (some of the transformations are not amenable for pure SQL solutions) 3) write the whole thing in Python Would (3) be a bad idea? I know you can wrap C routines in Python but in this case since there isn't anything computationally prohibitive (e.g., optimization routines that require many iterative calculations), I think I/O may be as much of a bottleneck as the computation itself. Do you have any recommendations on further considerations or suggestions? Thanks
© Stack Overflow or respective owner