My data:
It's a 71 MB file with 1.5 million rows.
It has 6 fields
All six fields combine to form a unique key - so that's what I need to sort on.
Sort statement:
sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -o output.csv input.csv
The problem:
If I sort without keys, it takes 30 seconds.
If I sort with keys, it takes 660 seconds.
I need to sort with keys to keep this generic and useful for other files that have non-key fields as well. The 30 second timing is fine, but the 660 is a killer.
More details using unix time:
sort input.csv -o output.csv = 28 seconds
sort -t ',' -k1 input.csv -o output.csv = 28 seconds
sort -t ',' -k1,1 input.csv -o output.csv = 64 seconds
sort -t ',' -k1,1 -k2,2 input.csv -o output.csv = 194 seconds
sort -t ',' -k1,1 -k2,2 -k3,3 input.csv -o output.csv = 328 seconds
sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 input.csv -o output.csv = 483 seconds
sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 input.csv -o output.csv = 561 seconds
sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 input.csv -o output.csv = 660 seconds
I could theoretically move the temp directory to SSD, and/or split the file into 4 parts, sort them separately (in parallel) then merge the results, etc. But I'm hoping for something simpler since looks like sort is just picking a bad algorithm.
Any suggestions?
Testing Improvements using buffer-size:
With 2 keys I got a 5% improvement with 8, 20, 24 MB and best performance of 8% improvement with 16MB, but 6% worse with 128MB
With 6 keys I got a 5% improvement with 8, 20, 24 MB and best performance of 9% improvement with 16MB.
Testing improvements using dictionary order (just 1 run each):
sort -d --buffer-size=8M -t ',' -k1,1 -k2,2 input.csv -o output.csv = 235 seconds (21% worse)
sort -d --buffer-size=8M -t ',' -k1,1 -k2,2 input.csv -o ouput.csv = 232 seconds (21% worse)
conclusion: it makes sense that this would slow the process down, not useful
Testing with different file system on SSD - I can't do this on this server now.
Testing with code to consolidate adjacent keys:
def consolidate_keys(key_fields, key_types):
""" Inputs:
- key_fields - a list of numbers in quotes: ['1','2','3']
- key_types - a list of types of the key_fields: ['integer','string','integer']
Outputs:
- key_fields - a consolidated list: ['1,2','3']
- key_types - a list of types of the consolidated list: ['string','integer']
"""
assert(len(key_fields) == len(key_types))
def get_min(val):
vals = val.split(',')
assert(len(vals) <= 2)
return vals[0]
def get_max(val):
vals = val.split(',')
assert(len(vals) <= 2)
return vals[len(vals)-1]
i = 0
while True:
try:
if ( (int(get_max(key_fields[i])) + 1) == int(key_fields[i+1])
and key_types[i] == key_types[i+1]):
key_fields[i] = '%s,%s' % (get_min(key_fields[i]), key_fields[i+1])
key_types[i] = key_types[i]
key_fields.pop(i+1)
key_types.pop(i+1)
continue
i = i+1
except IndexError:
break # last entry
return key_fields, key_types
While this code is just a work-around that'll only apply to cases in which I've got a contiguous set of keys - it speeds up the code by 95% in my worst case scenario.