Unix sort 10x slower with keys specified

Posted by KenFar on Super User See other posts from Super User or by KenFar
Published on 2012-06-14T19:49:49Z Indexed on 2012/06/15 15:19 UTC
Read the original article Hit count: 208

Filed under:
|

My data:

  • It's a 71 MB file with 1.5 million rows.
  • It has 6 fields, four of which are strings of avg. 15 characters, two are integers. Three of the fields are sometimes empty. All six fields combine to form a unique key - and that's what I need to sort on.

Sort statement:

sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -o a_out.csv a_in.csv

The problem:

  • If I sort without keys, it takes 30 seconds.
  • If I sort with keys, it takes 660 seconds.
  • I need to sort with keys to keep this generic and useful for other files that have non-key fields as well. The 30 second timing is fine, but the 660 is a killer.

I could theoretically move the temp directory to SSD, and/or split the file into 4 parts, sort them separately (in parallel) then merge the results, etc. But I'm hoping for something simpler since these results are so bad as-is.

Any suggestions?

© Super User or respective owner

Related posts about unix

Related posts about sorting