[llvm-dev] RFC: Reducing the number of set classes in ADT
Justin Lebar via llvm-dev
llvm-dev at lists.llvm.org
Fri Oct 14 11:58:58 PDT 2016
tl;dr: I think we have too many different set classes. They have incompatible
APIs and surprising behavior. I think we can reduce their number substantially.
Dear all,
The following is a subset of the set classes we have in ADT:
* DenseSet
* SmallDenseSet (https://reviews.llvm.org/D25628)
* SetVector
* SmallSetVector
* SmallSet
* SmallPtrSet
* StringSet
* FoldingSet
* ContextualFoldingSet
* FoldingSetVector
Every one of these classes has quirks. Here are some particularly bad ones:
* SmallSet falls back to std::set<T> when it runs out of inline storage (which
itself is pretty bad because std::set will pwn your caches). But if T is a
pointer, SmallSet is an alias for SmallPtrSet, which has a completely
different API, and has completely different performance characteristics.
"It's vector<bool> all over again."
As a concrete example, I see a lot of code using SmallSet<std::pair<T*, U*>>,
and it would be reasonable to assume this is similar to SmallSet<T*>. But
it's not.
* SetVector's set is a DenseSet, but SmallSetVector's set is a SmallSet, which
uses a std::set, not a DenseSet. Using DenseSet for one and std::set for the
other means that, in addition to getting different performance
characteristics, the set of key types that work with SetVector and
SmallSetVector are rather different.
For example, today you can use std::string as the key of a SmallSetVector,
but not a SetVector. (Although using std::string inside of SmallSetVector is
kind of perverse, because unless your strings are all very short, you're
going to malloc twice for every string you insert.)
* FoldingSet uses chaining, which is likely less memory- and time-efficient
than open addressing (which pretty much all of our other set classes use).
* SmallPtrSet has an interface that lets you operate on an instance without
knowing its inline size (SmallPtrSetImpl), but SmallSetVector and
SmallDenseSet don't have this.
But more to my point, each of these sets has a different API. Some support
limited C++11-isms, others don't. Most don't support the full range of stl set
operations, but they all pick and choose a different subset of operations to
support. And many add their own API extensions, which are of course all
different.
I think we can do better than this.
It seems to me that ideally we'd need only four set classes, down from ten:
* DenseSet
* SmallDenseSet
* SetVector (uses std::vector and DenseMap)
* SmallSetVector (uses SmallVector and SmallDenseMap)
Let me try to convince you that this is feasible.
* SmallPtrSet is already basically the same as SmallDenseSet.
* StringSet is just a DenseSet with a special key type that's an owning
string, but where you can do lookups given a StringRef. This is like
DenseMap::find_as.
* SmallSet is usually used with pointer types, so in those cases is just
another name for SmallPtrSet. In other cases, we can use DenseSet (in the
worst case by wrapping the key type in a class that adds "is empty" and "is
toombstone" bits).
* I may be wrong about this one, but at first blush, it seems that FoldingSet
is morally equivalent to DenseMap::find_as.
I hope the advantages of having fewer set classes is clear, but to be explicit:
This would simplify our code and reduce surprising behavior. We wouldn't need
to remember all of the API quirks of the different set classes. And it would
let us focus our efforts on making one set of great set APIs, instead of adding
API improvements piecemeal across all of our classes. More wood, fewer arrows,
that sort of thing.
As another benefit, having one base hashtable implementation would mean that if
someone wanted to try implementing a newfangled hashtable algorithm (cuckoo or
Robin Hood hashing, anyone?), we'd reap the benefit across all of llvm's sets.
Obviously this is a large change, and I don't propose we make it quickly or all
at once. Here's how I currently imagine we'd do it:
1. Switch SmallSetVector to use SmallDenseSet.
This is a relatively small change; we just have to fix up some users who use
keys that aren't compatible with DenseSet.
2. Get rid of SmallSet, replacing it with SmallDenseSet and SmallPtrSet.
This will involve
a) making a SmallDenseSetImpl (like SmallPtrSetImpl), so that you can
operate on SmallDenseSets without knowing their size, and
b) fixing up users whose keys that aren't compatible with DenseSet (like
(1)).
Part (a) will also give us a SmallDenseMapImpl class, which should be
useful.
3. Get rid of SmallPtrSet, replacing it with SmallDenseSet.
4. Get rid of StringSet, replacing it with DenseSet (and improving DenseSet's
"heterogeneous lookup" APIs so that this retains approximately as nice an
API as StringSet currently has).
5. Get rid of FoldingSet? Again I haven't looked too deeply into this one.
These steps are roughly in increasing order of complexity, and I think any
prefix of this list is still worth doing.
I'm going to start sending out patches for (1), which is a small change that I
hope is uncontroversial. I expect even if we agree on the larger plan that
this will be a background task for me, so I don't anticipate getting through it
all anytime particularly soon.
Help would, of course, be appreciated!
Regards,
-Justin
More information about the llvm-dev
mailing list