Unix sort 10x slower with keys specified
Posted
by
KenFar
on Super User
See other posts from Super User
or by KenFar
Published on 2012-06-14T19:49:49Z
Indexed on
2012/06/15
15:19 UTC
Read the original article
Hit count: 208
My data:
- It's a 71 MB file with 1.5 million rows.
- It has 6 fields, four of which are strings of avg. 15 characters, two are integers. Three of the fields are sometimes empty. All six fields combine to form a unique key - and that's what I need to sort on.
Sort statement:
sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -o a_out.csv a_in.csv
The problem:
- If I sort without keys, it takes 30 seconds.
- If I sort with keys, it takes 660 seconds.
- I need to sort with keys to keep this generic and useful for other files that have non-key fields as well. The 30 second timing is fine, but the 660 is a killer.
I could theoretically move the temp directory to SSD, and/or split the file into 4 parts, sort them separately (in parallel) then merge the results, etc. But I'm hoping for something simpler since these results are so bad as-is.
Any suggestions?
© Super User or respective owner