Unix sort 10x slower with keys specified
- by KenFar
My data:
It's a 71 MB file with 1.5 million rows.
It has 6 fields, four of which are strings of avg. 15 characters, two
are integers. Three of the fields are sometimes empty. All six
fields combine to form a unique key - and that's what I need to sort
on.
Sort statement:
sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -o a_out.csv a_in.csv
The problem:
If I sort without keys, it takes 30 seconds.
If I sort with keys, it takes 660 seconds.
I need to sort with keys to keep this generic and useful for other files that have non-key fields as well. The 30 second timing is fine, but the 660 is a killer.
I could theoretically move the temp directory to SSD, and/or split the file into 4 parts, sort them separately (in parallel) then merge the results, etc. But I'm hoping for something simpler since these results are so bad as-is.
Any suggestions?