[PATCH] Switch lowering: extract jump tables and bit tests before building binary tree (PR22262)

Tue Mar 31 13:27:50 PDT 2015

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7337
@@ +7336,3 @@
+
+bool SelectionDAGBuilder::isSuitableForBitTests(unsigned NumDests,
+                                                unsigned NumClusters,
----------------
djasper wrote:
> Everything in this function needs comments. Even if it didn't have any before, write down what you think this does.
> 
> Also, I think I'd probably pull out two different function that have names corresponding to what they do.
Added comments and extracted rangeFitsInWord().

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7374
@@ +7373,3 @@
+  const int BW = DAG.getTargetLoweringInfo().getPointerTy().getSizeInBits();
+  assert((High - Low + 1).sle(BW));
+
----------------
djasper wrote:
> This assert basically test part of isSuitableForBitTests. Pull that part out into a separate well-named function instead.
I don't agree. I think this assert makes sense here: we're about to build the actual bit mask and assert that it fits. I've added a comment in the assert though.

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7430
@@ +7429,3 @@
+
+void SelectionDAGBuilder::findBitTestClusters(CaseClusterVector &Clusters,
+                                              const SwitchInst *SI) {
----------------
djasper wrote:
> djasper wrote:
> > With the new structure, I'd probably approach this entirely differently. IIUC, bit tests are really good to handle many (potentially relatively spares) case labels leading to the same destination.
> > 
> > Now, to make use of this, I'd first group all case clusters (maybe even before rangeifying them) and group them by destination. Now, start with the biggest group (possibly biggest weight by taking branch probabilities into account) and look at whether they can be lowered as bit tests. Thereby, it is possible to peel away the most bit-test-worthy groups (grouped by common destination) one at a time. I haven't completely thought about how to do this and there are a few additional things to consider:
> > 
> > - The range of a group might be too large (greater than the bit width). However, it might be possible to split that one group into multiple sub groups.
> > - It might be possible to lower multiple groups (Group1 leading to Dest1, Group2 leading to Dest2, ..) as a single bit test to conserve extra shifts (if the entire range fits into the bit width). This is currently (and in the previous implementation) sort of done by saying that there can be up to three unique destinations, but I don't that this is actually a good heuristic.
> > - At some point there will be a cut-off, i.e. lowering a group as bit test is not worth it. Trivially, single-destination groups are never worth to be lowered as bit tests (a single cmp is faster than and+cmp). But as we might be able to lower the others as jump tables, the cut-off might be higher.
> > 
> > Lets chat some more about this.
> "relatively sparse" was what I meant to say.
Yes, bit tests come up a lot, especially in code like "if (x == 5 || x == 17)" that we transform to switches internally.

I like the idea of grouping the cases by destination and trying to greedily find clusters for bit tests that way.

The tricky bit is that in the end, the CaseClusters need to be non-overlapping for the binary tree to work, i.e. if we find that {1,8,10,34} go to to one destination and could be a bit test, it's a problem if there are a bunch of other cases in between.

I'll need to think about this.

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7448
@@ +7447,3 @@
+
+  int BW = PTy.getSizeInBits();
+  // Partition Clusters into as few subsets as possible, where each subset has a
----------------
djasper wrote:
> I think this function is too long and BW should be replaced with something not abbreviated.
Renamed to BitWidth.

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7449
@@ +7448,3 @@
+  int BW = PTy.getSizeInBits();
+  // Partition Clusters into as few subsets as possible, where each subset has a
+  // range <= BW, and <= 3 unique destinations.
----------------
djasper wrote:
> This should be extended and made into a function comment, not hidden somewhere in the code.
Moved it to the top of the function.

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7454
@@ +7453,3 @@
+
+  // Share one buffer to save allocations.
+  std::unique_ptr<unsigned[]> Buf(new unsigned[N * 2]);
----------------
djasper wrote:
> Do you actually observe a difference here? Seems like premature optimization.
> 
> And also, there seem to be better ways to optimize this (e.g. use a SmallVector to keep the common case of small N on the stack or to re-use and allocated area for this pass instead of allocating a new one for each switch.
You're right, I'll just use SmallVectors. Most switches are very small.

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7470
@@ +7469,3 @@
+  // Note: loop indexes are signed to avoid underflow.
+  for (int64_t i = N - 2; i >= 0; i--) {
+    // Find optimal partitioning of Clusters[i..N-1].
----------------
djasper wrote:
> --i?
Done.

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7476
@@ +7475,3 @@
+
+    // Search for a better solution.
+    for (int64_t j = std::min(N - 1, i + BW - 1); j > i; j--) {
----------------
djasper wrote:
> Please write down what 'better' means.
Done.

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7482
@@ +7481,3 @@
+      APInt R = Clusters[j].High->getValue() - Clusters[i].Low->getValue();
+      uint64_t Range = R.getLimitedValue(UINT64_MAX - 1) + 1;
+      if (Range > (uint64_t)BW)
----------------
djasper wrote:
> If you pull apart the isSuitableForBitTests() function as mentioned above, you can likely re-use it here.
Done.

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7489
@@ +7488,3 @@
+      SmallPtrSet<MachineBasicBlock *, 8> Dests;
+      for (int64_t k = i; k <= j; k++) {
+        if (Clusters[k].Kind != CC_Range) {
----------------
djasper wrote:
> So, this is actually a O(N^3) algorithm?
See my "Slowness" FIXME :)

It's not O(n^3) though, as j-i is capped at BitWidth. I believe this algorithm is O(n) with a large constant.

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp:7837
@@ +7836,3 @@
+    findJumpTables(Clusters, &SI, DefaultMBB);
+    findBitTestClusters(Clusters, &SI);
+  }
----------------
djasper wrote:
> IIUC, this seems to be a fundamental difference between the previous and current implementation. The previous implementation called handleBitTestsSwitchCase first and only later handleJTSwitchCase. Why did you reverse the order?
> 
> IMO, it could be really important to prefer bit tests if there are many labels leading to the same destination. These inherently cause redundancy in jump tables.
Very good point, the order here is reversed, but I think the priorities here are still the same.

While the previous code would only check if a whole range could be lowered as bit tests or jump table, this new code is good at finding bit tests in a sub-range. In my testing, it was common that a bit test would be found which then prevented a jump table from being built.

The new code checks for jump tables first, because I think it is more expensive if we miss an opportunity for a jump table. However, FindJumpTable explicitly checks if that range could be lowered with bit tests instead, and bails if so. That is, we will not use a jump table to lower a range that could be lowered with bit tests. This is tested by the "bt_is_better" and "jt_is_better" tests in switch.ll.

As you pointed out on IRC, it's now possible that a small jump table prevents a larger bit test from being built. However, I think that's a less costly problem than the reverse.

================
Comment at: lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h:206
@@ -167,26 +205,3 @@
 
-  typedef std::vector<Case>           CaseVector;
-  typedef std::vector<CaseBits>       CaseBitsVector;
-  typedef CaseVector::iterator        CaseItr;
-  typedef std::pair<CaseItr, CaseItr> CaseRange;
-
-  /// CaseRec - A struct with ctor used in lowering switches to a binary tree
-  /// of conditional branches.
-  struct CaseRec {
-    CaseRec(MachineBasicBlock *bb, const ConstantInt *lt, const ConstantInt *ge,
-            CaseRange r) :
-    CaseBB(bb), LT(lt), GE(ge), Range(r) {}
-
-    /// CaseBB - The MBB in which to emit the compare and branch
-    MachineBasicBlock *CaseBB;
-    /// LT, GE - If nonzero, we know the current case value must be less-than or
-    /// greater-than-or-equal-to these Constants.
-    const ConstantInt *LT;
-    const ConstantInt *GE;
-    /// Range - A pair of iterators representing the range of case values to be
-    /// processed at this point in the binary search tree.
-    CaseRange Range;
-  };
-
-  typedef std::vector<CaseRec> CaseRecVector;
+    CaseBits(): Mask(0), BB(nullptr), Bits(0), ExtraWeight(0) { }
 
----------------
djasper wrote:
> Use space before ":" or ideally clang-format :-).
Done.

http://reviews.llvm.org/D8649

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/