[llvm-dev] RFC: Reducing the number of set classes in ADT

Fri Oct 14 11:58:58 PDT 2016

tl;dr: I think we have too many different set classes.  They have incompatible
APIs and surprising behavior. I think we can reduce their number substantially.

Dear all,

The following is a subset of the set classes we have in ADT:

 * DenseSet
 * SmallDenseSet (https://reviews.llvm.org/D25628)
 * SetVector
 * SmallSetVector
 * SmallSet
 * SmallPtrSet
 * StringSet
 * FoldingSet
 * ContextualFoldingSet
 * FoldingSetVector

Every one of these classes has quirks.  Here are some particularly bad ones:

* SmallSet falls back to std::set<T> when it runs out of inline storage (which
  itself is pretty bad because std::set will pwn your caches).  But if T is a
  pointer, SmallSet is an alias for SmallPtrSet, which has a completely
  different API, and has completely different performance characteristics.
  "It's vector<bool> all over again."

  As a concrete example, I see a lot of code using SmallSet<std::pair<T*, U*>>,
  and it would be reasonable to assume this is similar to SmallSet<T*>.  But
  it's not.

* SetVector's set is a DenseSet, but SmallSetVector's set is a SmallSet, which
  uses a std::set, not a DenseSet.  Using DenseSet for one and std::set for the
  other means that, in addition to getting different performance
  characteristics, the set of key types that work with SetVector and
  SmallSetVector are rather different.

  For example, today you can use std::string as the key of a SmallSetVector,
  but not a SetVector.  (Although using std::string inside of SmallSetVector is
  kind of perverse, because unless your strings are all very short, you're
  going to malloc twice for every string you insert.)

* FoldingSet uses chaining, which is likely less memory- and time-efficient
  than open addressing (which pretty much all of our other set classes use).

* SmallPtrSet has an interface that lets you operate on an instance without
  knowing its inline size (SmallPtrSetImpl), but SmallSetVector and
  SmallDenseSet don't have this.

But more to my point, each of these sets has a different API.  Some support
limited C++11-isms, others don't.  Most don't support the full range of stl set
operations, but they all pick and choose a different subset of operations to
support.  And many add their own API extensions, which are of course all
different.

I think we can do better than this.

It seems to me that ideally we'd need only four set classes, down from ten:

 * DenseSet
 * SmallDenseSet
 * SetVector (uses std::vector and DenseMap)
 * SmallSetVector (uses SmallVector and SmallDenseMap)

Let me try to convince you that this is feasible.

 * SmallPtrSet is already basically the same as SmallDenseSet.

 * StringSet is just a DenseSet with a special key type that's an owning
   string, but where you can do lookups given a StringRef.  This is like
   DenseMap::find_as.

 * SmallSet is usually used with pointer types, so in those cases is just
   another name for SmallPtrSet.  In other cases, we can use DenseSet (in the
   worst case by wrapping the key type in a class that adds "is empty" and "is
   toombstone" bits).

 * I may be wrong about this one, but at first blush, it seems that FoldingSet
   is morally equivalent to DenseMap::find_as.

I hope the advantages of having fewer set classes is clear, but to be explicit:
This would simplify our code and reduce surprising behavior.  We wouldn't need
to remember all of the API quirks of the different set classes.  And it would
let us focus our efforts on making one set of great set APIs, instead of adding
API improvements piecemeal across all of our classes.  More wood, fewer arrows,
that sort of thing.

As another benefit, having one base hashtable implementation would mean that if
someone wanted to try implementing a newfangled hashtable algorithm (cuckoo or
Robin Hood hashing, anyone?), we'd reap the benefit across all of llvm's sets.

Obviously this is a large change, and I don't propose we make it quickly or all
at once.  Here's how I currently imagine we'd do it:

1. Switch SmallSetVector to use SmallDenseSet.

   This is a relatively small change; we just have to fix up some users who use
   keys that aren't compatible with DenseSet.

2. Get rid of SmallSet, replacing it with SmallDenseSet and SmallPtrSet.

   This will involve

   a) making a SmallDenseSetImpl (like SmallPtrSetImpl), so that you can
      operate on SmallDenseSets without knowing their size, and

   b) fixing up users whose keys that aren't compatible with DenseSet (like
      (1)).

   Part (a) will also give us a SmallDenseMapImpl class, which should be
   useful.

3. Get rid of SmallPtrSet, replacing it with SmallDenseSet.

4. Get rid of StringSet, replacing it with DenseSet (and improving DenseSet's
   "heterogeneous lookup" APIs so that this retains approximately as nice an
   API as StringSet currently has).

5. Get rid of FoldingSet?  Again I haven't looked too deeply into this one.

These steps are roughly in increasing order of complexity, and I think any
prefix of this list is still worth doing.

I'm going to start sending out patches for (1), which is a small change that I
hope is uncontroversial.  I expect even if we agree on the larger plan that
this will be a background task for me, so I don't anticipate getting through it
all anytime particularly soon.

Help would, of course, be appreciated!

Regards,
-Justin