C#: LINQ vs foreach - Round 1.

Posted by James Michael Hare on Geeks with Blogs See other posts from Geeks with Blogs or by James Michael Hare
Published on Fri, 23 Apr 2010 04:54:08 GMT Indexed on 2010/04/23 12:03 UTC
Read the original article Hit count: 507

Filed under:

So I was reading Peter Kellner's blog entry on Resharper 5.0 and its LINQ refactoring and thought that was very cool.  But that raised a point I had always been curious about in my head -- which is a better choice: manual foreach loops or LINQ? 
 
The answer is not really clear-cut.  There are two sides to any code cost arguments: performance and maintainability.  The first of these is obvious and quantifiable.  Given any two pieces of code that perform the same function, you can run them side-by-side and see which piece of code performs better.
 
Unfortunately, this is not always a good measure.  Well written assembly language outperforms well written C++ code, but you lose a lot in maintainability which creates a big techncial debt load that is hard to offset as the application ages.  In contrast, higher level constructs make the code more brief and easier to understand, hence reducing technical cost.
 
Now, obviously in this case we're not talking two separate languages, we're comparing doing something manually in the language versus using a higher-order set of IEnumerable extensions that are in the System.Linq library.
 
Well, before we discuss any further, let's look at some sample code and the numbers.  First, let's take a look at the for loop and the LINQ expression.  This is just a simple find comparison:
 
    // find implemented via LINQ
    public static bool FindViaLinq(IEnumerable<int> list, int target)
    {
        return list.Any(item => item == target);
    }
 
 
    // find implemented via standard iteration
    public static bool FindViaIteration(IEnumerable<int> list, int target)
    {
        foreach (var i in list)
        {
            if (i == target)
            {
                return true;
            }
        }
 
        return false;
    }
 

Okay, looking at this from a maintainability point of view, the Linq expression is definitely more concise (8 lines down to 1) and is very readable in intention.  You don't have to actually analyze the behavior of the loop to determine what it's doing.
 
So let's take a look at performance metrics from 100,000 iterations of these methods on a List<int> of varying sizes filled with random data.  For this test, we fill a target array with 100,000 random integers and then run the exact same pseudo-random targets through both searches.
 
                    List<T> On 100,000 Iterations
    Method      Size     Total (ms)  Per Iteration (ms)  % Slower
    Any         10       26          0.00046             30.00%
    Iteration   10       20          0.00023             -
    Any         100      116         0.00201             18.37%
    Iteration   100      98          0.00118             -
    Any         1000     1058        0.01853             16.78%
    Iteration   1000     906         0.01155             -
    Any         10,000   10,383      0.18189             17.41%
    Iteration   10,000   8843        0.11362             -
    Any         100,000  104,004     1.8297              18.27%
    Iteration   100,000  87,941      1.13163             -

 
The LINQ expression is running about 17% slower for average size collections and worse for smaller collections.  Presumably, this is due to the overhead of the state machine used to track the iterators for the yield returns in the LINQ expressions, which seems about right in a tight loop such as this.
 
So what about other LINQ expressions?  After all, Any() is one of the more trivial ones.  I decided to try the TakeWhile() algorithm using a Count() to get the position stopped like the sample Pete was using in his blog that Resharper refactored for him into LINQ:
 
    // Linq form
    public static int GetTargetPosition1(IEnumerable<int> list, int target)
    {
        return list.TakeWhile(item => item != target).Count();
    }
 
    // traditionally iterative form
    public static int GetTargetPosition2(IEnumerable<int> list, int target)
    {
        int count = 0;
 
        foreach (var i in list)
        {
            if(i == target)
            {
                break;
            }
 
            ++count;
        }
 
        return count;
    }
 

Once again, the LINQ expression is much shorter, easier to read, and should be easier to maintain over time, reducing the cost of technical debt.  So I ran these through the same test data:
 
                    List<T> On 100,000 Iterations
    Method      Size     Total (ms)  Per Iteration (ms)  % Slower
    TakeWhile   10       41          0.00041             128%
    Iteration   10       18          0.00018             -
    TakeWhile   100      171         0.00171             88%
    Iteration   100      91          0.00091             -
    TakeWhile   1000     1604        0.01604             94%
    Iteration   1000     825         0.00825             -
    TakeWhile   10,000   15765       0.15765             92%
    Iteration   10,000   8204        0.08204             -
    TakeWhile   100,000  156950      1.5695              92%
    Iteration   100,000  81635       0.81635             -
 
 
Wow!  I expected some overhead due to the state machines iterators produce, but 90% slower?  That seems a little heavy to me.  So then I thought, well, what if TakeWhile() is not the right tool for the job?  The problem is TakeWhile returns each item for processing using yield return, whereas our for-loop really doesn't care about the item beyond using it as a stop condition to evaluate.

So what if that back and forth with the iterator state machine is the problem?  Well, we can quickly create an (albeit ugly) lambda that uses the Any() along with a count in a closure (if a LINQ guru knows a better way PLEASE let me know!), after all , this is more consistent with what we're trying to do, we're trying to find the first occurence of an item and halt once we find it, we just happen to be counting on the way.  This mostly matches Any().
 
    // a new method that uses linq but evaluates the count in a closure.
    public static int TakeWhileViaLinq2(IEnumerable<int> list, int target)
    {
        int count = 0;
        list.Any(item =>
            {
                if(item == target)
                {
                    return true;
                }
 
                ++count;
                return false;
            });
        return count;
    }
   

Now how does this one compare?  
 
                    List<T> On 100,000 Iterations
    Method         Size     Total (ms)  Per Iteration (ms)  % Slower
    TakeWhile      10       41          0.00041             128%
    Any w/Closure  10       23          0.00023             28%
    Iteration      10       18          0.00018             -
    TakeWhile      100      171         0.00171             88%
    Any w/Closure  100      116         0.00116             27%
    Iteration      100      91          0.00091             -
    TakeWhile      1000     1604        0.01604             94%
    Any w/Closure  1000     1101        0.01101             33%
    Iteration      1000     825         0.00825             -
    TakeWhile      10,000   15765       0.15765             92%
    Any w/Closure  10,000   10802       0.10802             32%
    Iteration      10,000   8204        0.08204             -
    TakeWhile      100,000  156950      1.5695              92%
    Any w/Closure  100,000  108378      1.08378             33%
    Iteration      100,000  81635       0.81635             -

   
Much better!  It seems that the overhead of TakeAny() returning each item and updating the state in the state machine is drastically reduced by using Any() since Any() iterates forward until it finds the value we're looking for -- for the task we're attempting to do.
 
So the lesson there is, make sure when you use a LINQ expression you're choosing the best expression for the job, because if you're doing more work than you really need, you'll have a slower algorithm.  But this is true of any choice of algorithm or collection in general.
   
Even with the Any() with the count in the closure it is still about 30% slower, but let's consider that angle carefully.  For a list of 100,000 items, it was the difference between 1.01 ms and 0.82 ms roughly in a List<T>.  That's really not that bad at all in the grand scheme of things.  Even running at 90% slower with TakeWhile(), for the vast majority of my projects, an extra millisecond to save potential errors in the long term and improve maintainability is a small price to pay.  And if your typical list is 1000 items or less we're talking only microseconds worth of difference.
 
It's like they say: 90% of your performance bottlenecks are in 2% of your code, so over-optimizing almost never pays off.  So personally, I'll take the LINQ expression wherever I can because they will be easier to read and maintain (thus reducing technical debt) and I can rely on Microsoft's development to have coded and unit tested those algorithm fully for me instead of relying on a developer to code the loop logic correctly.
 
If something's 90% slower, yes, it's worth keeping in mind, but it's really not until you start get magnitudes-of-order slower (10x, 100x, 1000x) that alarm bells should really go off.  And if I ever do need that last millisecond of performance?  Well then I'll optimize JUST THAT problem spot.  To me it's worth it for the readability, speed-to-market, and maintainability.

© Geeks with Blogs or respective owner