[PATCH] D34200: [PM/unswitch] Teach SimpleLoopUnswitch to do non-trivial unswitching, making it no longer even remotely simple.

Wed Jun 28 03:12:12 PDT 2017

chandlerc added a comment.

Thanks for initial comments. I think I have most of these (except for a couple of the "extract this chunk" comments and one issue w/ ranking) addressed so maybe useful for another look. I'll work on the remaining bits in parallel.

================
Comment at: lib/Transforms/Scalar/SimpleLoopUnswitch.cpp:771-773
+  // Recurse down through further child loops.
+  for (Loop *ChildL : OrigL)
+    cloneLoop(*ChildL, ClonedL, VMap, LI);
----------------
dberris wrote:
> What is the typical depth for these things? i.e. is there a risk that a sufficiently complex loop within a function will cause this to fail spectacularly? Will it be more complex to actually do a DFS using a stack and iteratively going through the blocks?
In practice, we can often get away with this. But really there is no reason.

There were lots of other subtle and annoying aspects of how I was cloning loops and blocks into loops. For example, this did a really poor job of ordering the blocks within the loop data structure. While this doesn't matter for correctness or any of the test cases, it would make debugging subsequent transformation passes really, really annoying so it seems worth doing in a better way.

What's more, that is tied up in a gross ineffeciency in the code that I even had a FIXME for, so I decided to re-work how this is done.

The result is a bit longer in terms of lines-of-code, but I think it is quite a bit cleaner. We now *directly* clone each loop in the loop nest, without recursion, and with easier to reason about logic for inserting blocks into their block lists.

Below, we don't map blocks into loops all over the place, and instead use a more rich intermediate data structure and then do a single pass to map the blocks into the exit loops.

I know this was just a small comment on yoru part, but thanks, I think the approach is quite a bit better now.

================
Comment at: lib/Transforms/Scalar/SimpleLoopUnswitch.cpp:1004
+
+void recursivelyEraseNode(DominatorTree &DT, BasicBlock *BB) {
+  SmallVector<DomTreeNode *, 4> Worklist;
----------------
dberris wrote:
> The name seems to be inconsistent with the behaviour -- i.e. it's not really recursively calling itself. Maybe remove the 'recursively' in the name? Also, why isn't this function static unlike the others?
It's recursive in the dominator tree, even though it isn't implemented that way. The non-recursive implementation is (I think) just a detail.

Great catch on the missing static, fixed.

================
Comment at: lib/Transforms/Scalar/SimpleLoopUnswitch.cpp:1110
+      ExitLoops.push_back(ExitL);
+      ExitsInLoops.push_back(ExitBB);
+      if (!ParentL || (ParentL != ExitL && ParentL->contains(ExitL)))
----------------
sanjoy wrote:
> Can this be common'ed up with the logic in `buildClonedLoops`?
I really, really wanted to do that.

I couldn't figure out how to do it.

This *only* has to delete things and hoist things. It doesn't ever create a new loop. So while the overall structure is clearly very similar, at each step we end up doing a subtly different thing.

The biggest thing that seems common between them to me is the reverse walk to re-establish the loop body. However, *within* that walk we have quite different code. Still, could possibly try to extract some of that logic to a helper. Other ideas?

================
Comment at: lib/Transforms/Scalar/SimpleLoopUnswitch.cpp:1398
+///
+/// It also updates both dominator tree and loopinfo based on the unswitching.
+///
----------------
sanjoy wrote:
> The commit message says you don't update the dom tree?
I don't *incrementally* update it, but it does get updated.

================
Comment at: lib/Transforms/Scalar/SimpleLoopUnswitch.cpp:1461
+  SmallPtrSet<BasicBlock *, 4> SkippedLoopAndExitBlocks;
+  if (ContinueSuccBB->getUniquePredecessor() ||
+      llvm::all_of(predecessors(ContinueSuccBB), [&](BasicBlock *PredBB) {
----------------
sanjoy wrote:
> Again, I think you can just directly ask the domtree for this.
There is a BasicBlockEdge query to the domtree, but I'm not sure its a good idea...

For a branch instruction, it would be fine. But when this becomes a switch, I think it'll do the wrong thing. There, we'll be fine if there are multiple "continue" edges from the switch's parent BB to the continue successor BB, but because there are multiple edges, the edge-based dominates query will return false.

A different way of looking at this is that this is actually materially different from the dominates query -- this doesn't check if there is *a* dominating edge, it checks if there are any edges entering this successor from a block other than the parent of the unswitched terminator. Which is also what the code is somewhat obviously doing.

Anyways, happy to rewrite this if you still prefer a method query, but curious whaht exact query you would prefer.

================
Comment at: lib/Transforms/Scalar/SimpleLoopUnswitch.cpp:1851
+  // 1) An unswitching cost below the threshold
+  // 2) The smallest number of duplicated unswitch candidates (to avoid
+  //    creating redundant subsequent unswitching)
----------------
sanjoy wrote:
> I may be missing something, but I only see (1).  Also, a one liner on what "unswitching cost" and "cost after unswitching" mean will be helpful.
No, I wrote this comment in part as notes to myself, and I never returned to it. I also think I need to think more about the ranking here. This will take me a bit of time though, and hopefully the rest we can make review progress on in the interim...

================
Comment at: lib/Transforms/Scalar/SimpleLoopUnswitch.cpp:1871
+      // Don't count successors more than once.
+      if (!Visited.insert(SuccBB).second)
+        continue;
----------------
sanjoy wrote:
> Right now this is always true, right (since you're only unswitching `br` instructions that have different successors)?
Correct, I've just tried to write the code with as few assumptions about the terminator as possible because unlike the trivial case, the logic is complex enough that I imagine almost all of it will end up shared.

================
Comment at: lib/Transforms/Scalar/SimpleLoopUnswitch.cpp:1879
+      if (SuccBB->getUniquePredecessor() ||
+          llvm::all_of(predecessors(SuccBB), [&](BasicBlock *PredBB) {
+            return PredBB == &BB || DT.dominates(SuccBB, PredBB);
----------------
sanjoy wrote:
> Isn't this the same as edge domination?
See my response above. Again, I'm happy to rewrite this whichever way seems most clear.

================
Comment at: lib/Transforms/Scalar/SimpleLoopUnswitch.cpp:1901
+                 << " for unswitch candidate: " << *CandidateTI << "\n");
+    if (!UnswitchTI || CandidateCost < UnswitchCost) {
+      UnswitchTI = CandidateTI;
----------------
sanjoy wrote:
> Aren't you reading uninitialized memory here (for `UnswitchCost`)?
> 
> I'd also name `UnswitchCost` in a way that indicates it is the "best so far".
No, because we wait until we have *some* unswitch terminator first, and only then compare the cost.

Updated names.

================
Comment at: lib/Transforms/Scalar/SimpleLoopUnswitch.cpp:1907
+
+  if (UnswitchCost < UnswitchThreshold) {
+    DEBUG(dbgs() << "  Trying to unswitch non-trivial (cost = " << UnswitchCost
----------------
sanjoy wrote:
> How about putting all of the logic up to this point into a `TerminatorInst *findNonTrivialUnswitchCandidate` function?
I like factoring something like this... Question though, is it useful to keep the call to actually do the unswitch separate? Happy to look at the code and try some splits / arrangements to make it a bit more manageable, I'm definitely not (yet) happy w/ the factoring.

https://reviews.llvm.org/D34200