Embarrassingly parallel workflow creates too many output files
Posted
by
Hooked
on Stack Overflow
See other posts from Stack Overflow
or by Hooked
Published on 2012-11-28T15:40:29Z
Indexed on
2012/11/28
17:05 UTC
Read the original article
Hit count: 249
On a Linux cluster I run many (N > 10^6
) independent computations. Each computation takes only a few minutes and the output is a handful of lines. When N
was small I was able to store each result in a separate file to be parsed later. With large N
however, I find that I am wasting storage space (for the file creation) and simple commands like ls
require extra care due to internal limits of bash: -bash: /bin/ls: Argument list too long
.
Each computation is required to run through a qsub
scheduling algorithm so I am unable to create a master program which simply aggregates the output data to a single file. The simple solution of appending to a single fails when two programs finish at the same time and interleave their output. I have no admin access to the cluster, so installing a system-wide database is not an option.
How can I collate the output data from embarrassingly parallel computation before it gets unmanageable?
© Stack Overflow or respective owner