Internal Mutation of Persistent Data Structures
- by Greg Ros
To clarify, when I mean use the terms persistent and immutable on a data structure, I mean that:
The state of the data structure remains unchanged for its lifetime. It always holds the same data, and the same operations always produce the same results.
The data structure allows Add, Remove, and similar methods that return new objects of its kind, modified as instructed, that may or may not share some of the data of the original object.
However, while a data structure may seem to the user as persistent, it may do other things under the hood. To be sure, all data structures are, internally, at least somewhere, based on mutable storage.
If I were to base a persistent vector on an array, and copy it whenever Add is invoked, it would still be persistent, as long as I modify only locally created arrays.
However, sometimes, you can greatly increase performance by mutating a data structure under the hood. In more, say, insidious, dangerous, and destructive ways. Ways that might leave the abstraction untouched, not letting the user know anything has changed about the data structure, but being critical in the implementation level.
For example, let's say that we have a class called ArrayVector implemented using an array. Whenever you invoke Add, you get a ArrayVector build on top of a newly allocated array that has an additional item. A sequence of such updates will involve n array copies and allocations.
Here is an illustration:
However, let's say we implement a lazy mechanism that stores all sorts of updates -- such as Add, Set, and others in a queue. In this case, each update requires constant time (adding an item to a queue), and no array allocation is involved.
When a user tries to get an item in the array, all the queued modifications are applied under the hood, requiring a single array allocation and copy (since we know exactly what data the final array will hold, and how big it will be). Future get operations will be performed on an empty cache, so they will take a single operation. But in order to implement this, we need to 'switch' or mutate the internal array to the new one, and empty the cache -- a very dangerous action.
However, considering that in many circumstances (most updates are going to occur in sequence, after all), this can save a lot of time and memory, it might be worth it -- you will need to ensure exclusive access to the internal state, of course.
This isn't a question about the efficacy of such a data structure. It's a more general question. Is it ever acceptable to mutate the internal state of a supposedly persistent or immutable object in destructive and dangerous ways? Does performance justify it? Would you still be able to call it immutable?
Oh, and could you implement this sort of laziness without mutating the data structure in the specified fashion?