[cfe-commits] Implementation of ImmutableMap and ImmutableSet

Thu Feb 12 18:39:22 PST 2009

On Feb 9, 2009, at 10:50 PM, Zhongxing Xu wrote:

> Hi Ted,
>
> What approach did you use to implement the immutable data  
> structures? Node copying? Fat node? I am trying to read that code.
>

Hi Zhongxing,

I'm assuming you are referring to ImmutableSet and ImmutableMap.   
ImmutableList has a really simple implementation, with new lists  
created by 'consing' them with old lists.

I'll be honest that I'm not certain what you mean by the term "node  
copying" and "fat node"; I can extrapolate, but they don't immediately  
mean something definitive to me.

Both ImmutableSet and ImmutableMap are built on top a purely  
functional balanced binary tree.  Chris Okasaki's excellent book on  
"Purely Functional Data Structures" has a good description of the  
algorithms and methods for implementing a functional red-black tree.   
Instead I used a functional AVL tree, with the algorithm modeled  
directly on the one used for Objective Caml's "Map" module.  You can  
basically look at the Map.ml file in a recent OCaml distribution and  
see a much more concise (and readable) implementation.

Each node in the functional AVL tree is immutable.  It consists of  
pointers to left and right children as well as the 'data' associated  
with the node.  Once a node is created it is never changed; it's  
children stay fixed as well as its data.  Children do not have back- 
pointers to their parents.  Indeed this doesn't make sense, as a node  
can be shared between multiple trees as different parents share common  
children.

Because there are no back pointers, iterators over a functional AVL  
tree must keep their own "recursion stack" for going down and up the  
edges of the tree.

For the ImmutableMap the data is a key-value pair and for ImmutableSet  
it is a single object.  All data is basically considered to be POD,  
making ImmutableSet/ImmutableMap best used with things like pointers  
or small values.  Much of the template magic in the AVL tree  
implementation has to do with handling the comparisons between nodes  
(where in the ImmutableMap case we want to compare keys and the  
ImmutableSet case we want to compare the data itself), so the  
implementation of the AVL tree gets muddled based on the fact it is  
used by multiple clients.

The most important operations for the functional AVL tree is adding  
and removing entries.  Both cases are written using recursive  
algorithms, which recursively dive into the tree to find the spot  
where the data should be placed or removed.  In the case where the  
data doesn't exist in the tree, a "remove" operation just returns the  
original tree.  For the "add" operation, a set of new trees are  
created, adding pointers to existing subtrees as the recursion works  
its way back up to the root.  At the top a new root is created that  
shares some of the existing data of the original tree as well as a  
pointer to the new subtree containing the new data.  Starting with a  
balanced tree, without rebalancing an "add" or "removal" results in  
log(N) new nodes being created, where N is the number of nodes in the  
original tree and log(N) is its height.  Essentially it creates a new  
"path" from the new root to the subtree containing the data element.

With rebalancing, additional nodes may be created.  Rebalancing is  
done recursively, with the bottom being rebalanced first.  The values  
of nodes are copied (which I guess would be considered "node copying")  
to new nodes to do the rebalancing.  Thus new subtrees are created in  
the process to create a new aggregate tree that satisfies the  
properties of an AVL tree.

Note that the recursive nature of the rebalancing operation can cause  
some nodes to become completely unused (i.e., they can be freshly  
created and then have no parents point to them because the rebalancing  
at a higher level in the rebalancing recursion causes them to get  
discarded).  Because OCaml uses GC, their map implementation can  
create nodes that are lost; these are reclaimed by GC.  We instead  
allocate nodes using a BumpPtrAllocator (a pool allocator), which only  
reclaims memory it allocates all at once.  To avoid just leaking nodes  
like made, there is a "mutable" flag in each AVL node which indicates  
whether or not that node has "escaped" outside the confines of the  
current add/remove operation.  When the 'Add'/'Remove' operation  
returns, all nodes are marked immutable, but during the add/remove  
operation orphaned nodes that are still marked mutable are reused  
during the rebalancing.

As an added level of complexity, each node contains a hash digest  
which represents the hash of all its data and the data in its  
subtrees.  This is used to unique trees within a FoldingSet.  The  
result is that ImmutableMap::Factory and ImmutableSet::Factory will  
return the same exact map/set objects (with referential equality) for  
the same logical maps and sets.  This is used extensively in the  
static analyzer to avoid profiling entire sets and maps over and over.

Note that there are plenty of places that memory gets "leaked".  Nodes  
can get leaked when nothing points to them anymore, which can happen  
during rebalancing or when using the digests to fetch the "canonical"  
tree for a given collection of data.  For example, if a set of new  
subtrees get created while doing add and remove operations and (after  
several of these operations) we get a tree with the same digest as an  
existing tree those other trees will get orphaned.  This isn't so  
terrible since all of this memory will get reclaimed when the  
BumpPtrAllocator's destructor gets called, but it is something to  
consider if we find memory pressure to be an issue.

I'm sure that there are many details that I am leaving out; feel free  
to follow up with specific questions!

Ted