Embarrassingly parallel workflow creates too many output files

Posted by Hooked on Stack Overflow See other posts from Stack Overflow or by Hooked
Published on 2012-11-28T15:40:29Z Indexed on 2012/11/28 17:05 UTC
Read the original article Hit count: 249

On a Linux cluster I run many (N > 10^6) independent computations. Each computation takes only a few minutes and the output is a handful of lines. When N was small I was able to store each result in a separate file to be parsed later. With large N however, I find that I am wasting storage space (for the file creation) and simple commands like ls require extra care due to internal limits of bash: -bash: /bin/ls: Argument list too long.

Each computation is required to run through a qsub scheduling algorithm so I am unable to create a master program which simply aggregates the output data to a single file. The simple solution of appending to a single fails when two programs finish at the same time and interleave their output. I have no admin access to the cluster, so installing a system-wide database is not an option.

How can I collate the output data from embarrassingly parallel computation before it gets unmanageable?

© Stack Overflow or respective owner

Related posts about bash

Related posts about file-io