[cfe-commits] Implementation of ImmutableMap and ImmutableSet
Ted Kremenek
kremenek at apple.com
Thu Feb 12 18:39:22 PST 2009
On Feb 9, 2009, at 10:50 PM, Zhongxing Xu wrote:
> Hi Ted,
>
> What approach did you use to implement the immutable data
> structures? Node copying? Fat node? I am trying to read that code.
>
Hi Zhongxing,
I'm assuming you are referring to ImmutableSet and ImmutableMap.
ImmutableList has a really simple implementation, with new lists
created by 'consing' them with old lists.
I'll be honest that I'm not certain what you mean by the term "node
copying" and "fat node"; I can extrapolate, but they don't immediately
mean something definitive to me.
Both ImmutableSet and ImmutableMap are built on top a purely
functional balanced binary tree. Chris Okasaki's excellent book on
"Purely Functional Data Structures" has a good description of the
algorithms and methods for implementing a functional red-black tree.
Instead I used a functional AVL tree, with the algorithm modeled
directly on the one used for Objective Caml's "Map" module. You can
basically look at the Map.ml file in a recent OCaml distribution and
see a much more concise (and readable) implementation.
Each node in the functional AVL tree is immutable. It consists of
pointers to left and right children as well as the 'data' associated
with the node. Once a node is created it is never changed; it's
children stay fixed as well as its data. Children do not have back-
pointers to their parents. Indeed this doesn't make sense, as a node
can be shared between multiple trees as different parents share common
children.
Because there are no back pointers, iterators over a functional AVL
tree must keep their own "recursion stack" for going down and up the
edges of the tree.
For the ImmutableMap the data is a key-value pair and for ImmutableSet
it is a single object. All data is basically considered to be POD,
making ImmutableSet/ImmutableMap best used with things like pointers
or small values. Much of the template magic in the AVL tree
implementation has to do with handling the comparisons between nodes
(where in the ImmutableMap case we want to compare keys and the
ImmutableSet case we want to compare the data itself), so the
implementation of the AVL tree gets muddled based on the fact it is
used by multiple clients.
The most important operations for the functional AVL tree is adding
and removing entries. Both cases are written using recursive
algorithms, which recursively dive into the tree to find the spot
where the data should be placed or removed. In the case where the
data doesn't exist in the tree, a "remove" operation just returns the
original tree. For the "add" operation, a set of new trees are
created, adding pointers to existing subtrees as the recursion works
its way back up to the root. At the top a new root is created that
shares some of the existing data of the original tree as well as a
pointer to the new subtree containing the new data. Starting with a
balanced tree, without rebalancing an "add" or "removal" results in
log(N) new nodes being created, where N is the number of nodes in the
original tree and log(N) is its height. Essentially it creates a new
"path" from the new root to the subtree containing the data element.
With rebalancing, additional nodes may be created. Rebalancing is
done recursively, with the bottom being rebalanced first. The values
of nodes are copied (which I guess would be considered "node copying")
to new nodes to do the rebalancing. Thus new subtrees are created in
the process to create a new aggregate tree that satisfies the
properties of an AVL tree.
Note that the recursive nature of the rebalancing operation can cause
some nodes to become completely unused (i.e., they can be freshly
created and then have no parents point to them because the rebalancing
at a higher level in the rebalancing recursion causes them to get
discarded). Because OCaml uses GC, their map implementation can
create nodes that are lost; these are reclaimed by GC. We instead
allocate nodes using a BumpPtrAllocator (a pool allocator), which only
reclaims memory it allocates all at once. To avoid just leaking nodes
like made, there is a "mutable" flag in each AVL node which indicates
whether or not that node has "escaped" outside the confines of the
current add/remove operation. When the 'Add'/'Remove' operation
returns, all nodes are marked immutable, but during the add/remove
operation orphaned nodes that are still marked mutable are reused
during the rebalancing.
As an added level of complexity, each node contains a hash digest
which represents the hash of all its data and the data in its
subtrees. This is used to unique trees within a FoldingSet. The
result is that ImmutableMap::Factory and ImmutableSet::Factory will
return the same exact map/set objects (with referential equality) for
the same logical maps and sets. This is used extensively in the
static analyzer to avoid profiling entire sets and maps over and over.
Note that there are plenty of places that memory gets "leaked". Nodes
can get leaked when nothing points to them anymore, which can happen
during rebalancing or when using the digests to fetch the "canonical"
tree for a given collection of data. For example, if a set of new
subtrees get created while doing add and remove operations and (after
several of these operations) we get a tree with the same digest as an
existing tree those other trees will get orphaned. This isn't so
terrible since all of this memory will get reclaimed when the
BumpPtrAllocator's destructor gets called, but it is something to
consider if we find memory pressure to be an issue.
I'm sure that there are many details that I am leaving out; feel free
to follow up with specific questions!
Ted
More information about the cfe-commits
mailing list