kenfar - Developer IT

Unix sort keys cause performance problems

- by KenFar

My data: It's a 71 MB file with 1.5 million rows. It has 6 fields All six fields combine to form a unique key - so that's what I need to sort on. Sort statement: sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -o output.csv input.csv The problem: If I sort without keys, it takes 30 seconds. If I sort with keys, it takes 660 seconds. I need to sort with keys to keep this generic and useful for other files that have non-key fields as well. The 30 second timing is fine, but the 660 is a killer. More details using unix time: sort input.csv -o output.csv = 28 seconds sort -t ',' -k1 input.csv -o output.csv = 28 seconds sort -t ',' -k1,1 input.csv -o output.csv = 64 seconds sort -t ',' -k1,1 -k2,2 input.csv -o output.csv = 194 seconds sort -t ',' -k1,1 -k2,2 -k3,3 input.csv -o output.csv = 328 seconds sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 input.csv -o output.csv = 483 seconds sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 input.csv -o output.csv = 561 seconds sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 input.csv -o output.csv = 660 seconds I could theoretically move the temp directory to SSD, and/or split the file into 4 parts, sort them separately (in parallel) then merge the results, etc. But I'm hoping for something simpler since looks like sort is just picking a bad algorithm. Any suggestions? Testing Improvements using buffer-size: With 2 keys I got a 5% improvement with 8, 20, 24 MB and best performance of 8% improvement with 16MB, but 6% worse with 128MB With 6 keys I got a 5% improvement with 8, 20, 24 MB and best performance of 9% improvement with 16MB. Testing improvements using dictionary order (just 1 run each): sort -d --buffer-size=8M -t ',' -k1,1 -k2,2 input.csv -o output.csv = 235 seconds (21% worse) sort -d --buffer-size=8M -t ',' -k1,1 -k2,2 input.csv -o ouput.csv = 232 seconds (21% worse) conclusion: it makes sense that this would slow the process down, not useful Testing with different file system on SSD - I can't do this on this server now. Testing with code to consolidate adjacent keys: def consolidate_keys(key_fields, key_types): """ Inputs: - key_fields - a list of numbers in quotes: ['1','2','3'] - key_types - a list of types of the key_fields: ['integer','string','integer'] Outputs: - key_fields - a consolidated list: ['1,2','3'] - key_types - a list of types of the consolidated list: ['string','integer'] """ assert(len(key_fields) == len(key_types)) def get_min(val): vals = val.split(',') assert(len(vals) <= 2) return vals[0] def get_max(val): vals = val.split(',') assert(len(vals) <= 2) return vals[len(vals)-1] i = 0 while True: try: if ( (int(get_max(key_fields[i])) + 1) == int(key_fields[i+1]) and key_types[i] == key_types[i+1]): key_fields[i] = '%s,%s' % (get_min(key_fields[i]), key_fields[i+1]) key_types[i] = key_types[i] key_fields.pop(i+1) key_types.pop(i+1) continue i = i+1 except IndexError: break # last entry return key_fields, key_types While this code is just a work-around that'll only apply to cases in which I've got a contiguous set of keys - it speeds up the code by 95% in my worst case scenario.

Read the article

File Sync Solution for Batch Processing (ETL)

- by KenFar

I'm looking for a slightly different kind of sync utility - not one designed to keep two directories identical, but rather one intended to keep files flowing from one host to another. The context is a data warehouse that currently has a custom-developed solution that moves 10,000 files a day, some of which are 1+ gbytes gzipped files, between linux servers via ssh. Files are produced by the extract process, then moved to the transform server where a transform daemon is waiting to pick them up. The same process happens between transform & load. Once the files are moved they are typically archived on the source for a week, and the downstream process likewise moves them to temp then archive as it consumes them. So, my requirements & desires: It is never used to refresh updated files - only used to deliver new files. Because it's delivering files to downstream processes - it needs to rename the file once done so that a partial file doesn't get picked up. In order to simplify recovery, it should keep a copy of the source files - but rename them or move them to another directory. If the transfer fails (network down, file system full, permissions, file locked, etc), then it should retry periodically - and never fail in a non-recoverable way, or a way that sends the file twice or never sends the file. Should be able to copy files to 2+ destinations. Should have a consolidated log so that it's easy to find problems Should have an optional checksum feature Any recommendations? Can Unison do this well?

Read the article

Unix sort 10x slower with keys specified

- by KenFar

My data: It's a 71 MB file with 1.5 million rows. It has 6 fields, four of which are strings of avg. 15 characters, two are integers. Three of the fields are sometimes empty. All six fields combine to form a unique key - and that's what I need to sort on. Sort statement: sort -t ',' -k1,1 -k2,2 -k3,3 -k4,4 -k5,5 -k6,6 -o a_out.csv a_in.csv The problem: If I sort without keys, it takes 30 seconds. If I sort with keys, it takes 660 seconds. I need to sort with keys to keep this generic and useful for other files that have non-key fields as well. The 30 second timing is fine, but the 660 is a killer. I could theoretically move the temp directory to SSD, and/or split the file into 4 parts, sort them separately (in parallel) then merge the results, etc. But I'm hoping for something simpler since these results are so bad as-is. Any suggestions?

Read the article

Best way to test instance methods without running init

- by KenFar

I've got a simple class that gets most of its arguments via init, which also runs a variety of private methods that do most of the work. Output is available either through access to object variables or public methods. Here's the problem - I'd like my unittest framework to directly call the private methods called by init with different data - without going through init. What's the best way to do this? So far, I've been refactoring these classes so that init does less and data is passed in separately. This makes testing easy, but I think the usability of the class suffers a little. EDIT: Example solution based on Ignacio's answer: import types class C(object): def __init__(self, number): new_number = self._foo(number) self._bar(new_number) def _foo(self, number): return number * 2 def _bar(self, number): print number * 10 #--- normal execution - should print 160: ------- MyC = C(8) #--- testing execution - should print 80 -------- MyC = object.__new__(C) MyC._bar(8)

Read the article

Best way to test class methods without running init

- by KenFar

I've got a simple class that gets most of its arguments via init, which also runs a variety of private methods that do most of the work. Output is available either through access to object variables or public methods. Here's the problem - I'd like my unittest framework to directly call the private methods called by init with different data - without going through init. What's the best way to do this? So far, I've been refactoring these classes so that init does less and data is passed in separately. This makes testing easy, but I think the usability of the class suffers a little.

Developer IT