How can I quickly parse large (>10GB) files?
Posted
by Andrew
on Stack Overflow
See other posts from Stack Overflow
or by Andrew
Published on 2009-12-17T01:56:48Z
Indexed on
2010/03/19
16:01 UTC
Read the original article
Hit count: 177
Hi - I have to process text files 10-20GB in size of the format: field1 field2 field3 field4 field5
I would like to parse the data from each line of field2 into one of several files; the file this gets pushed into is determined line-by-line by the value in field4. There are 25 different possible values in field2 and hence 25 different files the data can get parsed into.
I have tried using Perl (slow) and awk (faster but still slow) - does anyone have any suggestions or pointers toward alternative approaches?
FYI here is the awk code I was trying to use; note I had to revert to going through the large file 25 times because I wasn't able to keep 25 files open at once in awk:
chromosomes=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25)
for chr in ${chromosomes[@]}
do
awk < my_in_file_here -v pat="$chr" '{if ($4 == pat) for (i = $2; i <= $2+52; i++) print i}' >> my_out_file_"$chr".query
done
© Stack Overflow or respective owner