We've completed the first iteration

Posted by CliveT on Simple Talk See other posts from Simple Talk or by CliveT
Published on Thu, 16 Dec 2010 00:00:00 GMT Indexed on 2010/12/16 21:14 UTC
Read the original article Hit count: 326

Filed under:

There are a lot of features in C# that are implemented by the compiler and not by the underlying platform.

One such feature is a lambda expression. Since local variables cannot be accessed once the current method activation finishes, the compiler has to go out of its way to generate a new class which acts as a home for any variable whose lifetime needs to be extended past the activation of the procedure.

Take the following example:

    Random generator = new Random();
    Func func = () => generator.Next(10);

In this case, the compiler generates a new class called <>c_DisplayClass1 which is marked with the CompilerGenerated attribute.

[CompilerGenerated]
private sealed class <>c__DisplayClass1
{
    // Fields
    public Random generator;

    // Methods
    public int b__0()
    {
        return this.generator.Next(10);
    }
}

Two quick comments on this:

(i)    A display was the means that compilers for languages like Algol recorded the various lexical contours of the nested procedure activations on the stack. I imagine that this is what has led to the name.

(ii)    It is a shame that the same attribute is used to mark all compiler generated classes as it makes it hard to figure out what they are being used for. Indeed, you could imagine optimisations that the runtime could perform if it knew that classes corresponded to certain high level concepts.

We can see that the local variable generator has been turned into a field in the class, and the body of the lambda expression has been turned into a method of the new class. The code that builds the Func object simply constructs an instance of this class and initialises the fields to their initial values.

    <>c__DisplayClass1 class2 = new <>c__DisplayClass1();
    class2.generator = new Random();
    Func func = new Func(class2.b__0);

Reflector already contains code to spot this pattern of code and reproduce the form containing the lambda expression, so this is example is correctly decompiled.

The use of compiler generated code is even more spectacular in the case of iterators. C# introduced the idea of a method that could automatically store its state between calls, so that it can pick up where it left off. The code can express the logical flow with yield return and yield break denoting places where the method should return a particular value and be prepared to resume.

        {
            yield return 1;
            yield return 2;
            yield return 3;
        }

Of course, there was already a .NET pattern for expressing the idea of returning a sequence of values with the computation proceeding lazily (in the sense that the work for the next value is executed on demand). This is expressed by the IEnumerable interface with its Current property for fetching the current value and the MoveNext method for forcing the computation of the next value. The sequence is terminated when this method returns false.

The C# compiler links these two ideas together so that an IEnumerator returning method using the yield keyword causes the compiler to produce the implementation of an Iterator.

Take the following piece of code.

        IEnumerable GetItems()
        {
            yield return 1;
            yield return 2;
            yield return 3;
        }

The compiler implements this by defining a new class that implements a state machine. This has an integer state that records which yield point we should go to if we are resumed. It also has a field that records the Current value of the enumerator and a field for recording the thread. This latter value is used for optimising the creation of iterator instances.

[CompilerGenerated]
private sealed class d__0 : IEnumerable, IEnumerable, IEnumerator, IEnumerator, IDisposable
{
    // Fields
    private int <>1__state;
    private int <>2__current;
    public Program <>4__this;
    private int <>l__initialThreadId;

The body gets converted into the code to construct and initialize this new class.

private IEnumerable GetItems()
{
    d__0 d__ = new d__0(-2);
    d__.<>4__this = this;
    return d__;
}

When the class is constructed we set the state, which was passed through as -2 and the current thread.

public d__0(int <>1__state)
{
    this.<>1__state = <>1__state;
    this.<>l__initialThreadId = Thread.CurrentThread.ManagedThreadId;
}

The state needs to be set to 0 to represent a valid enumerator and this is done in the GetEnumerator method which optimises for the usual case where the returned enumerator is only used once.

IEnumerator IEnumerable.GetEnumerator()
{
    if ((Thread.CurrentThread.ManagedThreadId == this.<>l__initialThreadId)
              && (this.<>1__state == -2))
    {
        this.<>1__state = 0;
        return this;
    }

The state machine itself is implemented inside the MoveNext method.

private bool MoveNext()
{
    switch (this.<>1__state)
    {
        case 0:
            this.<>1__state = -1;
            this.<>2__current = 1;
            this.<>1__state = 1;
            return true;

        case 1:
            this.<>1__state = -1;
            this.<>2__current = 2;
            this.<>1__state = 2;
            return true;

        case 2:
            this.<>1__state = -1;
            this.<>2__current = 3;
            this.<>1__state = 3;
            return true;

        case 3:
            this.<>1__state = -1;
            break;
    }
    return false;
}

At each stage, the current value of the state is used to determine how far we got, and then we generate the next value which we return after recording the next state. Finally we return false from the MoveNext to signify the end of the sequence.

Of course, that example was really simple. The original method body didn't have any local variables. Any local variables need to live between the calls to MoveNext and so they need to be transformed into fields in much the same way that we did in the case of the lambda expression. More complicated MoveNext methods are required to deal with resources that need to be disposed when the iterator finishes, and sometimes the compiler uses a temporary variable to hold the return value.

Why all of this explanation?

We've implemented the de-compilation of iterators in the current EAP version of Reflector (7).

pic1
This contrasts with previous version where all you could do was look at the MoveNext method and try to figure out the control flow.

pic2

There's a fair amount of things we have to do. We have to spot the use of a CompilerGenerated class which implements the Enumerator pattern. We need to go to the class and figure out the fields corresponding to the local variables. We then need to go to the MoveNext method and try to break it into the various possible states and spot the state transitions. We can then take these pieces and put them back together into an object model that uses yield return to show the transition points. After that Reflector can carry on optimising using its usual optimisations.

The pattern matching is currently a little too sensitive to changes in the code generation, and we only do a limited analysis of the MoveNext method to determine use of the compiler generated fields. In some ways, it is a pity that iterators are compiled away and there is no metadata that reflects the original intent. Without it, we are always going to dependent on our knowledge of the compiler's implementation. For example, we have noticed that the Async CTP changes the way that iterators are code generated, so we'll have to do some more work to support that.

However, with that warning in place, we seem to do a reasonable job of decompiling the iterators that are built into the framework. Hopefully, the EAP will give us a chance to find examples where we don't spot the pattern correctly or regenerate the wrong code, and we can improve things.

Please give it a go, and report any problems.

© Simple Talk or respective owner