[UNIX] Sort lines of massive file by number of words on line (ideally in parallel)
- by conradlee
I am working on a community detection algorithm for analyzing social network data from Facebook. The first task, detecting all cliques in the graph, can be done efficiently in parallel, and leaves me with an output like this:
17118 17136 17392
17064 17093 17376
17118 17136 17356 17318 12345
17118 17136 17356 17283
17007 17059 17116
Each of these lines represents a unique clique (a collection of node ids), and I want to sort these lines in descending order by the number of ids per line. In the case of the example above, here's what the output should look like:
17118 17136 17356 17318 12345
17118 17136 17356 17283
17118 17136 17392
17064 17093 17376
17007 17059 17116
(Ties---i.e., lines with the same number of ids---can be sorted arbitrarily.)
What is the most efficient way of sorting these lines.
Keep the following points in mind:
The file I want to sort could be larger than the physical memory of the machine
Most of the machines that I'm running this on have several processors, so a parallel solution would be ideal
An ideal solution would just be a shell script (probably using sort), but I'm open to simple solutions in python or perl (or any language, as long as it makes the task simple)
This task is in some sense very easy---I'm not just looking for any old solution, but rather for a simple and above all efficient solution