Unix sort keys cause performance problems

Posted by KenFar on Super User See other posts from Super User or by KenFar
Published on 2012-06-14T19:49:49Z Indexed on 2012/06/16 21:19 UTC
Read the original article Hit count: 735

Filed under:
|

My data:

  • It's a 71 MB file with 1.5 million rows.
  • It has 6 fields
  • All six fields combine to form a unique key - so that's what I need to sort on.

Sort statement:

sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -o output.csv input.csv

The problem:

  • If I sort without keys, it takes 30 seconds.
  • If I sort with keys, it takes 660 seconds.
  • I need to sort with keys to keep this generic and useful for other files that have non-key fields as well. The 30 second timing is fine, but the 660 is a killer.

More details using unix time:

  • sort input.csv -o output.csv = 28 seconds
  • sort -t ',' -k1 input.csv -o output.csv = 28 seconds
  • sort -t ',' -k1,1 input.csv -o output.csv = 64 seconds
  • sort -t ',' -k1,1 -k2,2 input.csv -o output.csv = 194 seconds
  • sort -t ',' -k1,1 -k2,2 -k3,3 input.csv -o output.csv = 328 seconds
  • sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 input.csv -o output.csv = 483 seconds
  • sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 input.csv -o output.csv = 561 seconds
  • sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 input.csv -o output.csv = 660 seconds

I could theoretically move the temp directory to SSD, and/or split the file into 4 parts, sort them separately (in parallel) then merge the results, etc. But I'm hoping for something simpler since looks like sort is just picking a bad algorithm.

Any suggestions?

Testing Improvements using buffer-size:

  • With 2 keys I got a 5% improvement with 8, 20, 24 MB and best performance of 8% improvement with 16MB, but 6% worse with 128MB
  • With 6 keys I got a 5% improvement with 8, 20, 24 MB and best performance of 9% improvement with 16MB.

Testing improvements using dictionary order (just 1 run each):

  • sort -d --buffer-size=8M -t ',' -k1,1 -k2,2 input.csv -o output.csv = 235 seconds (21% worse)
  • sort -d --buffer-size=8M -t ',' -k1,1 -k2,2 input.csv -o ouput.csv = 232 seconds (21% worse)
  • conclusion: it makes sense that this would slow the process down, not useful

Testing with different file system on SSD - I can't do this on this server now.

Testing with code to consolidate adjacent keys:

def consolidate_keys(key_fields, key_types):
""" Inputs:
         - key_fields - a list of numbers in quotes: ['1','2','3']
         - key_types - a list of types of the key_fields: ['integer','string','integer']
    Outputs:
         - key_fields - a consolidated list:  ['1,2','3']
         - key_types - a list of types of the consolidated list: ['string','integer']
"""
assert(len(key_fields) == len(key_types))

def get_min(val):
    vals = val.split(',')
    assert(len(vals) <= 2)
    return vals[0]

def get_max(val):
    vals = val.split(',')
    assert(len(vals) <= 2)
    return vals[len(vals)-1]

i = 0
while True:
    try:
        if ( (int(get_max(key_fields[i])) + 1) == int(key_fields[i+1])
        and  key_types[i] == key_types[i+1]):
                key_fields[i] = '%s,%s' % (get_min(key_fields[i]), key_fields[i+1])
                key_types[i]  = key_types[i]
                key_fields.pop(i+1)
                key_types.pop(i+1)
                continue
        i = i+1
    except IndexError:
        break  # last entry

return key_fields, key_types

While this code is just a work-around that'll only apply to cases in which I've got a contiguous set of keys - it speeds up the code by 95% in my worst case scenario.

© Super User or respective owner

Related posts about unix

Related posts about sorting