Parallelism in .NET – Part 11, Divide and Conquer via Parallel.Invoke
- by Reed
Many algorithms are easily written to work via recursion. For example, most data-oriented tasks where a tree of data must be processed are much more easily handled by starting at the root, and recursively “walking” the tree. Some algorithms work this way on flat data structures, such as arrays, as well. This is a form of divide and conquer: an algorithm design which is based around breaking up a set of work recursively, “dividing” the total work in each recursive step, and “conquering” the work when the remaining work is small enough to be solved easily.
Recursive algorithms, especially ones based on a form of divide and conquer, are often a very good candidate for parallelization.
This is apparent from a common sense standpoint. Since we’re dividing up the total work in the algorithm, we have an obvious, built-in partitioning scheme. Once partitioned, the data can be worked upon independently, so there is good, clean isolation of data.
Implementing this type of algorithm is fairly simple. The Parallel class in .NET 4 includes a method suited for this type of operation: Parallel.Invoke. This method works by taking any number of delegates defined as an Action, and operating them all in parallel. The method returns when every delegate has completed:
Parallel.Invoke(
() =>
{
Console.WriteLine("Action 1 executing in thread {0}",
Thread.CurrentThread.ManagedThreadId);
},
() =>
{
Console.WriteLine("Action 2 executing in thread {0}",
Thread.CurrentThread.ManagedThreadId);
},
() =>
{
Console.WriteLine("Action 3 executing in thread {0}",
Thread.CurrentThread.ManagedThreadId);
}
);
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, "Courier New", courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
Running this simple example demonstrates the ease of using this method. For example, on my system, I get three separate thread IDs when running the above code. By allowing any number of delegates to be executed directly, concurrently, the Parallel.Invoke method provides us an easy way to parallelize any algorithm based on divide and conquer. We can divide our work in each step, and execute each task in parallel, recursively.
For example, suppose we wanted to implement our own quicksort routine. The quicksort algorithm can be designed based on divide and conquer. In each iteration, we pick a pivot point, and use that to partition the total array. We swap the elements around the pivot, then recursively sort the lists on each side of the pivot.
For example, let’s look at this simple, sequential implementation of quicksort:
public static void QuickSort<T>(T[] array) where T : IComparable<T>
{
QuickSortInternal(array, 0, array.Length - 1);
}
private static void QuickSortInternal<T>(T[] array, int left, int right)
where T : IComparable<T>
{
if (left >= right)
{
return;
}
SwapElements(array, left, (left + right) / 2);
int last = left;
for (int current = left + 1; current <= right; ++current)
{
if (array[current].CompareTo(array[left]) < 0)
{
++last;
SwapElements(array, last, current);
}
}
SwapElements(array, left, last);
QuickSortInternal(array, left, last - 1);
QuickSortInternal(array, last + 1, right);
}
static void SwapElements<T>(T[] array, int i, int j)
{
T temp = array[i];
array[i] = array[j];
array[j] = temp;
}
Here, we implement the quicksort algorithm in a very common, divide and conquer approach. Running this against the built-in Array.Sort routine shows that we get the exact same answers (although the framework’s sort routine is slightly faster). On my system, for example, I can use framework’s sort to sort ten million random doubles in about 7.3s, and this implementation takes about 9.3s on average.
Looking at this routine, though, there is a clear opportunity to parallelize. At the end of QuickSortInternal, we recursively call into QuickSortInternal with each partition of the array after the pivot is chosen. This can be rewritten to use Parallel.Invoke by simply changing it to:
// Code above is unchanged...
SwapElements(array, left, last);
Parallel.Invoke(
() => QuickSortInternal(array, left, last - 1),
() => QuickSortInternal(array, last + 1, right)
);
}
This routine will now run in parallel. When executing, we now see the CPU usage across all cores spike while it executes.
However, there is a significant problem here – by parallelizing this routine, we took it from an execution time of 9.3s to an execution time of approximately 14 seconds! We’re using more resources as seen in the CPU usage, but the overall result is a dramatic slowdown in overall processing time.
This occurs because parallelization adds overhead. Each time we split this array, we spawn two new tasks to parallelize this algorithm! This is far, far too many tasks for our cores to operate upon at a single time. In effect, we’re “over-parallelizing” this routine. This is a common problem when working with divide and conquer algorithms, and leads to an important observation:
When parallelizing a recursive routine, take special care not to add more tasks than necessary to fully utilize your system.
This can be done with a few different approaches, in this case. Typically, the way to handle this is to stop parallelizing the routine at a certain point, and revert back to the serial approach. Since the first few recursions will all still be parallelized, our “deeper” recursive tasks will be running in parallel, and can take full advantage of the machine. This also dramatically reduces the overhead added by parallelizing, since we’re only adding overhead for the first few recursive calls.
There are two basic approaches we can take here. The first approach would be to look at the total work size, and if it’s smaller than a specific threshold, revert to our serial implementation. In this case, we could just check right-left, and if it’s under a threshold, call the methods directly instead of using Parallel.Invoke.
The second approach is to track how “deep” in the “tree” we are currently at, and if we are below some number of levels, stop parallelizing. This approach is a more general-purpose approach, since it works on routines which parse trees as well as routines working off of a single array, but may not work as well if a poor partitioning strategy is chosen or the tree is not balanced evenly.
This can be written very easily. If we pass a maxDepth parameter into our internal routine, we can restrict the amount of times we parallelize by changing the recursive call to:
// Code above is unchanged...
SwapElements(array, left, last);
if (maxDepth < 1)
{
QuickSortInternal(array, left, last - 1, maxDepth);
QuickSortInternal(array, last + 1, right, maxDepth);
}
else
{
--maxDepth;
Parallel.Invoke(
() => QuickSortInternal(array, left, last - 1, maxDepth),
() => QuickSortInternal(array, last + 1, right, maxDepth));
}
We no longer allow this to parallelize indefinitely – only to a specific depth, at which time we revert to a serial implementation. By starting the routine with a maxDepth equal to Environment.ProcessorCount, we can restrict the total amount of parallel operations significantly, but still provide adequate work for each processing core.
With this final change, my timings are much better. On average, I get the following timings:
Framework via Array.Sort: 7.3 seconds
Serial Quicksort Implementation: 9.3 seconds
Naive Parallel Implementation: 14 seconds
Parallel Implementation Restricting Depth: 4.7 seconds
Finally, we are now faster than the framework’s Array.Sort implementation.