So I was reading Peter Kellner's blog entry on Resharper 5.0 and its LINQ refactoring and thought that was very cool. But that raised a point I had always been curious about in my head -- which is a better choice: manual foreach loops or LINQ?
The answer is not really clear-cut. There are two sides to any code cost arguments: performance and maintainability. The first of these is obvious and quantifiable. Given any two pieces of code that perform the same function, you can run them side-by-side and see which piece of code performs better.
Unfortunately, this is not always a good measure. Well written assembly language outperforms well written C++ code, but you lose a lot in maintainability which creates a big techncial debt load that is hard to offset as the application ages. In contrast, higher level constructs make the code more brief and easier to understand, hence reducing technical cost.
Now, obviously in this case we're not talking two separate languages, we're comparing doing something manually in the language versus using a higher-order set of IEnumerable extensions that are in the System.Linq library.
Well, before we discuss any further, let's look at some sample code and the numbers. First, let's take a look at the for loop and the LINQ expression. This is just a simple find comparison:
// find implemented via LINQ
public static bool FindViaLinq(IEnumerable<int> list, int target)
{
return list.Any(item => item == target);
}
// find implemented via standard iteration
public static bool FindViaIteration(IEnumerable<int> list, int target)
{
foreach (var i in list)
{
if (i == target)
{
return true;
}
}
return false;
}
Okay, looking at this from a maintainability point of view, the Linq expression is definitely more concise (8 lines down to 1) and is very readable in intention. You don't have to actually analyze the behavior of the loop to determine what it's doing.
So let's take a look at performance metrics from 100,000 iterations of these methods on a List<int> of varying sizes filled with random data. For this test, we fill a target array with 100,000 random integers and then run the exact same pseudo-random targets through both searches.
List<T> On 100,000 Iterations
Method Size Total (ms) Per Iteration (ms) % Slower
Any 10 26 0.00046 30.00%
Iteration 10 20 0.00023 -
Any 100 116 0.00201 18.37%
Iteration 100 98 0.00118 -
Any 1000 1058 0.01853 16.78%
Iteration 1000 906 0.01155 -
Any 10,000 10,383 0.18189 17.41%
Iteration 10,000 8843 0.11362 -
Any 100,000 104,004 1.8297 18.27%
Iteration 100,000 87,941 1.13163 -
The LINQ expression is running about 17% slower for average size collections and worse for smaller collections. Presumably, this is due to the overhead of the state machine used to track the iterators for the yield returns in the LINQ expressions, which seems about right in a tight loop such as this.
So what about other LINQ expressions? After all, Any() is one of the more trivial ones. I decided to try the TakeWhile() algorithm using a Count() to get the position stopped like the sample Pete was using in his blog that Resharper refactored for him into LINQ:
// Linq form
public static int GetTargetPosition1(IEnumerable<int> list, int target)
{
return list.TakeWhile(item => item != target).Count();
}
// traditionally iterative form
public static int GetTargetPosition2(IEnumerable<int> list, int target)
{
int count = 0;
foreach (var i in list)
{
if(i == target)
{
break;
}
++count;
}
return count;
}
Once again, the LINQ expression is much shorter, easier to read, and should be easier to maintain over time, reducing the cost of technical debt. So I ran these through the same test data:
List<T> On 100,000 Iterations
Method Size Total (ms) Per Iteration (ms) % Slower
TakeWhile 10 41 0.00041 128%
Iteration 10 18 0.00018 -
TakeWhile 100 171 0.00171 88%
Iteration 100 91 0.00091 -
TakeWhile 1000 1604 0.01604 94%
Iteration 1000 825 0.00825 -
TakeWhile 10,000 15765 0.15765 92%
Iteration 10,000 8204 0.08204 -
TakeWhile 100,000 156950 1.5695 92%
Iteration 100,000 81635 0.81635 -
Wow! I expected some overhead due to the state machines iterators produce, but 90% slower? That seems a little heavy to me. So then I thought, well, what if TakeWhile() is not the right tool for the job? The problem is TakeWhile returns each item for processing using yield return, whereas our for-loop really doesn't care about the item beyond using it as a stop condition to evaluate.
So what if that back and forth with the iterator state machine is the problem? Well, we can quickly create an (albeit ugly) lambda that uses the Any() along with a count in a closure (if a LINQ guru knows a better way PLEASE let me know!), after all , this is more consistent with what we're trying to do, we're trying to find the first occurence of an item and halt once we find it, we just happen to be counting on the way. This mostly matches Any().
// a new method that uses linq but evaluates the count in a closure.
public static int TakeWhileViaLinq2(IEnumerable<int> list, int target)
{
int count = 0;
list.Any(item =>
{
if(item == target)
{
return true;
}
++count;
return false;
});
return count;
}
Now how does this one compare?
List<T> On 100,000 Iterations
Method Size Total (ms) Per Iteration (ms) % Slower
TakeWhile 10 41 0.00041 128%
Any w/Closure 10 23 0.00023 28%
Iteration 10 18 0.00018 -
TakeWhile 100 171 0.00171 88%
Any w/Closure 100 116 0.00116 27%
Iteration 100 91 0.00091 -
TakeWhile 1000 1604 0.01604 94%
Any w/Closure 1000 1101 0.01101 33%
Iteration 1000 825 0.00825 -
TakeWhile 10,000 15765 0.15765 92%
Any w/Closure 10,000 10802 0.10802 32%
Iteration 10,000 8204 0.08204 -
TakeWhile 100,000 156950 1.5695 92%
Any w/Closure 100,000 108378 1.08378 33%
Iteration 100,000 81635 0.81635 -
Much better! It seems that the overhead of TakeAny() returning each item and updating the state in the state machine is drastically reduced by using Any() since Any() iterates forward until it finds the value we're looking for -- for the task we're attempting to do.
So the lesson there is, make sure when you use a LINQ expression you're choosing the best expression for the job, because if you're doing more work than you really need, you'll have a slower algorithm. But this is true of any choice of algorithm or collection in general.
Even with the Any() with the count in the closure it is still about 30% slower, but let's consider that angle carefully. For a list of 100,000 items, it was the difference between 1.01 ms and 0.82 ms roughly in a List<T>. That's really not that bad at all in the grand scheme of things. Even running at 90% slower with TakeWhile(), for the vast majority of my projects, an extra millisecond to save potential errors in the long term and improve maintainability is a small price to pay. And if your typical list is 1000 items or less we're talking only microseconds worth of difference.
It's like they say: 90% of your performance bottlenecks are in 2% of your code, so over-optimizing almost never pays off. So personally, I'll take the LINQ expression wherever I can because they will be easier to read and maintain (thus reducing technical debt) and I can rely on Microsoft's development to have coded and unit tested those algorithm fully for me instead of relying on a developer to code the loop logic correctly.
If something's 90% slower, yes, it's worth keeping in mind, but it's really not until you start get magnitudes-of-order slower (10x, 100x, 1000x) that alarm bells should really go off. And if I ever do need that last millisecond of performance? Well then I'll optimize JUST THAT problem spot. To me it's worth it for the readability, speed-to-market, and maintainability.