Some of you reading this will be wondering, "what is an iterator" and think I'm locked in the world of C++. Nope, I'm talking C# iterators. No, not enumerators, iterators.
So, for those of you who do not know what iterators are in C#, I will explain it in summary, and for those of you who know what iterators are but are curious of the performance impacts, I will explore that as well.
Iterators have been around for a bit now, and there are still a bunch of people who don't know what they are or what they do. I don't know how many times at work I've had a code review on my code and have someone ask me, "what's that yield word do?"
Basically, this post came to me as I was writing some extension methods to extend IEnumerable<T> -- I'll post some of the fun ones in a later post. Since I was filtering the resulting list down, I was using the standard C# iterator concept; but that got me wondering: what are the performance implications of using an iterator versus returning a new enumeration?
So, to begin, let's look at a couple of methods. This is a new (albeit contrived) method called Every(...). The goal of this method is to access and enumeration and return every nth item in the enumeration (including the first). So Every(2) would return items 0, 2, 4, 6, etc.
Now, if you wanted to write this in the traditional way, you may come up with something like this:
public static IEnumerable<T> Every<T>(this IEnumerable<T> list, int interval)
{
List<T> newList = new List<T>();
int count = 0;
foreach (var i in list)
{
if ((count++ % interval) == 0)
{
newList.Add(i);
}
}
return newList;
}
So basically this method takes any IEnumerable<T> and returns a new IEnumerable<T> that contains every nth item. Pretty straight forward.
The problem? Well, Every<T>(...) will construct a list containing every nth item whether or not you care. What happens if you were searching this result for a certain item and find that item after five tries? You would have generated the rest of the list for nothing.
Enter iterators. This C# construct uses the yield keyword to effectively defer evaluation of the next item until it is asked for. This can be very handy if the evaluation itself is expensive or if there's a fair chance you'll never want to fully evaluate a list.
We see this all the time in Linq, where many expressions are chained together to do complex processing on a list. This would be very expensive if each of these expressions evaluated their entire possible result set on call.
Let's look at the same example function, this time using an iterator:
public static IEnumerable<T> Every<T>(this IEnumerable<T> list, int interval)
{
int count = 0;
foreach (var i in list)
{
if ((count++ % interval) == 0)
{
yield return i;
}
}
}
Notice it does not create a new return value explicitly, the only evidence of a return is the "yield return" statement. What this means is that when an item is requested from the enumeration, it will enter this method and evaluate until it either hits a yield return (in which case that item is returned) or until it exits the method or hits a yield break (in which case the iteration ends.
Behind the scenes, this is all done with a class that the CLR creates behind the scenes that keeps track of the state of the iteration, so that every time the next item is asked for, it finds that item and then updates the current position so it knows where to start at next time.
It doesn't seem like a big deal, does it? But keep in mind the key point here: it only returns items as they are requested. Thus if there's a good chance you will only process a portion of the return list and/or if the evaluation of each item is expensive, an iterator may be of benefit.
This is especially true if you intend your methods to be chainable similar to the way Linq methods can be chained.
For example, perhaps you have a List<int> and you want to take every tenth one until you find one greater than 10. We could write that as:
List<int> someList = new List<int>();
// fill list here
someList.Every(10).TakeWhile(i => i <= 10);
Now is the difference more apparent? If we use the first form of Every that makes a copy of the list. It's going to copy the entire list whether we will need those items or not, that can be costly!
With the iterator version, however, it will only take items from the list until it finds one that is > 10, at which point no further items in the list are evaluated.
So, sounds neat eh? But what's the cost is what you're probably wondering. So I ran some tests using the two forms of Every above on lists varying from 5 to 500,000 integers and tried various things.
Now, iteration isn't free. If you are more likely than not to iterate the entire collection every time, iterator has some very slight overhead:
Copy vs Iterator on 100% of Collection (10,000 iterations)
Collection Size
Num Iterated
Type
Total ms
5
5
Copy
5
5
5
Iterator
5
50
50
Copy
28
50
50
Iterator
27
500
500
Copy
227
500
500
Iterator
247
5000
5000
Copy
2266
5000
5000
Iterator
2444
50,000
50,000
Copy
24,443
50,000
50,000
Iterator
24,719
500,000
500,000
Copy
250,024
500,000
500,000
Iterator
251,521
Notice that when iterating over the entire produced list, the times for the iterator are a little better for smaller lists, then getting just a slight bit worse for larger lists. In reality, given the number of items and iterations, the result is near negligible, but just to show that iterators come at a price. However, it should also be noted that the form of Every that returns a copy will have a left-over collection to garbage collect.
However, if we only partially evaluate less and less through the list, the savings start to show and make it well worth the overhead. Let's look at what happens if you stop looking after 80% of the list:
Copy vs Iterator on 80% of Collection (10,000 iterations)
Collection Size
Num Iterated
Type
Total ms
5
4
Copy
5
5
4
Iterator
5
50
40
Copy
27
50
40
Iterator
23
500
400
Copy
215
500
400
Iterator
200
5000
4000
Copy
2099
5000
4000
Iterator
1962
50,000
40,000
Copy
22,385
50,000
40,000
Iterator
19,599
500,000
400,000
Copy
236,427
500,000
400,000
Iterator
196,010
Notice that the iterator form is now operating quite a bit faster. But the savings really add up if you stop on average at 50% (which most searches would typically do):
Copy vs Iterator on 50% of Collection (10,000 iterations)
Collection Size
Num Iterated
Type
Total ms
5
2
Copy
5
5
2
Iterator
4
50
25
Copy
25
50
25
Iterator
16
500
250
Copy
188
500
250
Iterator
126
5000
2500
Copy
1854
5000
2500
Iterator
1226
50,000
25,000
Copy
19,839
50,000
25,000
Iterator
12,233
500,000
250,000
Copy
208,667
500,000
250,000
Iterator
122,336
Now we see that if we only expect to go on average 50% into the results, we tend to shave off around 40% of the time. And this is only for one level deep. If we are using this in a chain of query expressions it only adds to the savings.
So my recommendation? If you have a resonable expectation that someone may only want to partially consume your enumerable result, I would always tend to favor an iterator. The cost if they iterate the whole thing does not add much at all -- and if they consume only partially, you reap some really good performance gains.
Next time I'll discuss some of my favorite extensions I've created to make development life a little easier and maintainability a little better.