Unix sort keys cause performance problems

Posted by KenFar on Super User See other posts from Super User or by KenFar
Published on 2012-06-14T19:49:49Z Indexed on 2012/06/16 21:19 UTC
Read the original article Hit count: 819

Filed under:

sorting

My data:

It's a 71 MB file with 1.5 million rows.
It has 6 fields
All six fields combine to form a unique key - so that's what I need to sort on.

Sort statement:

sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -o output.csv input.csv

The problem:

If I sort without keys, it takes 30 seconds.
If I sort with keys, it takes 660 seconds.
I need to sort with keys to keep this generic and useful for other files that have non-key fields as well. The 30 second timing is fine, but the 660 is a killer.

More details using unix time:

sort input.csv -o output.csv = 28 seconds
sort -t ',' -k1 input.csv -o output.csv = 28 seconds
sort -t ',' -k1,1 input.csv -o output.csv = 64 seconds
sort -t ',' -k1,1 -k2,2 input.csv -o output.csv = 194 seconds
sort -t ',' -k1,1 -k2,2 -k3,3 input.csv -o output.csv = 328 seconds
sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 input.csv -o output.csv = 483 seconds
sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 input.csv -o output.csv = 561 seconds
sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 input.csv -o output.csv = 660 seconds

I could theoretically move the temp directory to SSD, and/or split the file into 4 parts, sort them separately (in parallel) then merge the results, etc. But I'm hoping for something simpler since looks like sort is just picking a bad algorithm.

Any suggestions?

Testing Improvements using buffer-size:

With 2 keys I got a 5% improvement with 8, 20, 24 MB and best performance of 8% improvement with 16MB, but 6% worse with 128MB
With 6 keys I got a 5% improvement with 8, 20, 24 MB and best performance of 9% improvement with 16MB.

Testing improvements using dictionary order (just 1 run each):

sort -d --buffer-size=8M -t ',' -k1,1 -k2,2 input.csv -o output.csv = 235 seconds (21% worse)
sort -d --buffer-size=8M -t ',' -k1,1 -k2,2 input.csv -o ouput.csv = 232 seconds (21% worse)
conclusion: it makes sense that this would slow the process down, not useful

Testing with different file system on SSD - I can't do this on this server now.

Testing with code to consolidate adjacent keys:

def consolidate_keys(key_fields, key_types):
""" Inputs:
         - key_fields - a list of numbers in quotes: ['1','2','3']
         - key_types - a list of types of the key_fields: ['integer','string','integer']
    Outputs:
         - key_fields - a consolidated list:  ['1,2','3']
         - key_types - a list of types of the consolidated list: ['string','integer']
"""
assert(len(key_fields) == len(key_types))

def get_min(val):
    vals = val.split(',')
    assert(len(vals) <= 2)
    return vals[0]

def get_max(val):
    vals = val.split(',')
    assert(len(vals) <= 2)
    return vals[len(vals)-1]

i = 0
while True:
    try:
        if ( (int(get_max(key_fields[i])) + 1) == int(key_fields[i+1])
        and  key_types[i] == key_types[i+1]):
                key_fields[i] = '%s,%s' % (get_min(key_fields[i]), key_fields[i+1])
                key_types[i]  = key_types[i]
                key_fields.pop(i+1)
                key_types.pop(i+1)
                continue
        i = i+1
    except IndexError:
        break  # last entry

return key_fields, key_types

While this code is just a work-around that'll only apply to cases in which I've got a contiguous set of keys - it speeds up the code by 95% in my worst case scenario.

Developer IT

Unix sort keys cause performance problems - Developer IT

Unix sort keys cause performance problems

unix

sorting

Related posts about unix

polkit: disable all users except those in group wheel?

How to find other end of unix socket connection?

Meaning of directories on Unix and Unix like systems

Easy understanding of UNIX and UNIX shell scripting!!

Is Mac OS X a licensed Unix or Unix-like clone that conforms to Unix specification?

Related posts about sorting

Sorting eigenvectors by their eigenvalues (associated sorting)

Sorting in SSRS Reports overriding Group sorting

org-sort multi: date/time (?d ?t) | priority (?p) | title (?a)

A Look at the GridView's New Sorting Styles in ASP.NET 4.0

A Look at the GridView's New Sorting Styles in ASP.NET 4.0

Categories cloud