[PATCH] D37467: Add a new pass to speculate around PHI nodes with constant (integer) operands when profitable.

Tue Oct 17 00:03:25 PDT 2017

chandlerc added a comment.

Hopefully responding to all of the comments. Note the last one which is probably the most interesting point of discussion around cost modeling.

================
Comment at: llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp:67
+    // instructions and allow handling stores in easily proven safe cases.
+    if (mayBeMemoryDependent(*UI)) {
+      DEBUG(dbgs() << "  Unsafe: can't speculate use: " << *UI << "\n");
----------------
davidxl wrote:
> chandlerc wrote:
> > davidxl wrote:
> > > This is too limiting.  The scenario is commonly created by the store sinking optimization -- which involves memory operation.
> > There is a FIXME discussing this already?
> > 
> > However, this pass is already able to handle the motivating case, so it seems reasonable to extend this cautiously in subsequent commits?
> This might affect the design decisions: use MemSSA or AS ?
> 
> It is probably good to handle the easiest cases (which covers 90% of the cases) where the memory instruction is just hoisted to its direct predecessor blocks.
FWIW, I think the current design handles all of the cases I've seen in the wild. Anyways, I don't disagree about handling this, it just seems easy to do in a follow-up patch.

================
Comment at: llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp:107
+        // we're done.
+        if (PotentialSpecSet.count(OpI))
+          continue;
----------------
chandlerc wrote:
> davidxl wrote:
> > why not checking this first?
> Sorry, this order came from when we didn't update the set as aggressively. Yeah, we should check this first.
Coming back to this -- the checks above this are actually probably cheaper. Neither require walking the operands at all and so they should allow us to completely avoid the potentially speculated set queries entirely. They're also both cases where we won't actually speculate the instruction at all. So I think the current order makes sense. I've added some comments to help clarify this.

================
Comment at: llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp:112
+        // If when we directly test whether this is safe it fails, bail.
+        if (UnsafeSet.count(OpI) || mayBeMemoryDependent(*OpI)) {
+          DEBUG(dbgs() << "  Unsafe: can't speculate transitive use: " << *OpI
----------------
davidxl wrote:
> chandlerc wrote:
> > davidxl wrote:
> > > What is the second check about? OpI is not a code motion candidate.
> > > 
> > > The first check seems wrong to me too.  At this point, we determined that OpI is not in a block that dominates the the target blocks of the code motion (hoisting), so it should return false immediately.  
> > > 
> > > I don't see the need to do a DFS walk to check the chains of operands either. 
> > > 
> > > The caching here seems also wrong and confusing.
> > If `OpI` isn't in a block that dominates the target blocks, we will need to hoist `OpI` as well as the original `UI`. That's the point of this walk and the discussion w/ Danny: we may need to hoist more than one instruction in order to hoist the user of the PHI.
> > 
> > This actually was necessary to get basic tests to work at all because in many cases the operand is a trivial/free instruction such as a trunc or zext between i64 and i32. We want to hoist those in addition to the PHI use. There is a test case that demonstrates this as well I think.
> > 
> > Because we're doing that, we need to actually recurse in some way. DFS vs. BFS is just a tradeoff in terms of how you visit things and my previous emails discussed why I think DFS is probably right.
> > 
> > Hopefully this also makes the caching more clear, but please let me know what is unclear/confusing if note.
> 
> So dependent OpI are also hoisting candidates. However I don't see the cost model actually consider them.
> 
> Besides, speculative and non-speculative hoisting candidates are all handled here. For real speculative ones, there is not enough check for safe speculations -- e.g. instructions that may trap or result in exceptions.
The cost model does consider the transitive operands that have to be speculated? There are tests around this.

Also, `mayBeMemoryDependent` handles all of the cases you mention. It requires the instruction to both be safe to speculatively execute (and therefore not control or data dependent in any way) and not read or write memory. It is essentially the strictest predicate we could use (hence the FIXMEs around using more nuanced checks in the future).

================
Comment at: llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp:117
+          // so we prune subsequent searches.
+          UnsafeSet.insert(OpI);
+          for (auto &StackPair : DFSStack) {
----------------
davidxl wrote:
> What is the point of tracking UnsafeSet?
> 
> The code will be much more readable if the following is done:
> 
> 1) collect the set of all operands (transitively) UI depends on, which are not defined in a dominating block of the hoisting target blocks;
> 2) check if it safe to hoist those dependent operands. If any one of them is not safe, bail.
See my conversation with Danny about why this is needed?

We are doing exactly what you suggest, but memoizing the result to reduce the total complexity of this algorithm.

================
Comment at: llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp:119
+          for (auto &StackPair : DFSStack) {
+            Instruction *I = StackPair.first;
+            UnsafeSet.insert(I);
----------------
davidxl wrote:
> Why not pop the DFSstack?
This is less code and avoids mutating the stack on each step of the loop. Return will free the memory anyways.

================
Comment at: llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp:256
+
+  DEBUG(dbgs() << "    Cost savings " << (MatCost - FoldedImmCost) << ": " << PN
+               << "\n");
----------------
davidxl wrote:
> chandlerc wrote:
> > davidxl wrote:
> > > The cost analysis here assumes non-speculative case where the uses of phi  are fully anticipated. Consider the real speculative case:
> > > 
> > > 
> > > ```
> > > 
> > > /// entry:
> > > ///     br i1 %flag, label %a, label %b
> > > ///
> > > ///   a:
> > > ///     br label %merge
> > > ///
> > > ///   b:
> > > ///     br label %merge
> > > ///
> > > ///   merge:
> > > ///     %p = phi i32 [ 7, %a ], [ 11, %b ]
> > > ///     br i1 %flag2, label %use, label %exit
> > > ///
> > > ///   use:
> > > ///     %sum = add i32 %arg, %p
> > > ///     br label %exit
> > > ///
> > > ///   exit:
> > > ///    %s = phi ([0, ...], [%sum, ..])
> > > ///     ret i32 %s
> > > 
> > > ```
> > > Hoisting the add into %a and %b is speculative. Additional speculation cost is involved depending on the cost of add and relative frequency of %a+%b vs %use.
> > > 
> > > 
> > > consider another case with triagular cfg:
> > > 
> > > ```
> > > 
> > > /// entry:
> > > ///     br i1 %flag, label %a, label %merge
> > > ///   a:
> > > ///     br label %merge
> > > ///
> > > ///   merge:
> > > ///     %p = phi i32 [ 7, %a ], [ 11, %entry ]
> > > ///     %sum = add i32 %arg, %p
> > > ///    ret i32 %sum
> > > 
> > > ```
> > > 
> > > hoisting 'add' into entry and '%a' introduces runtime overhead.
> > Ah, this explains perhaps the comment above regarding BFI.
> > 
> > The second case was definitely unintended. I can add something to prevent this. I'm not sure we *ever* want to do that case.
> > 
> > For the first case, I'm not so sure. If the 'add' is the same cost as materializing the constant, and we'll have to materialize the constant anyways, is this really a problem? We can suppress this too if so.
> > 
> > I'm somewhat interested in simpler cost modeling and just eliminating the hard cases here if possible (IE, until we see places where the fancier approach is needed).
> As an example for the first case. Suppose there are two incoming constants C1, and C2.     M(C1) and M(C2) are the cost of materializing the constants, and F(C1) and F(C2) are the costs of folding them into UI.   Assuming the frequency of two incoming blocks are P1 and P2.
> 
> The total cost before will be:   P1 * M(C1) + P2 * M (C2), and the total cost after is P1 * F(C1) + P2 * F(C2).
> 
> The comparison can be reduced to comparing  M(C1) + M(C2) with F(C1) + F(C2) only when
> 1) P1 == P2
> or
> 2) M(C1) == M(C2) and F(C1) == F(C2)
> 
> Depending on the value of C1 and C2, 2) may not be true.
> 
I should clarify why the second case you mentioned originally cannot in fact happen:

The code splits critical edges so that it should not be changing the total number of instructions executed along any given path. The speculation goes into a non-critical edge, and so it shouldn't go onto a path that wouldn't have executed that exact instruction anyways.

And again, there are already tests of this.

For the first case, I see what you're getting at with this example. However, I'm worried about scaling by probability. My concern is that I'm a little worried about doing way too much speculation just to remove a constant materialization in the case of fairly extremely hot paths. A very minor improvement along a very hot path could have nearly unbounded code size regressions along the cold path.

An alternative, simpler approach would be to check that, in addition to `F(C1) + F(C2) < M(C1) + M(C2)` being true, `F(C1) <= M(C1) && F(C2) <= M(C2)`. That should ensure that no path regresses its dynamic cost, and the total isn't a size regression overall. I haven't looked at it in detail, but something like this might make for an easier model than profile and still avoid regressions.

Naturally, if you're aware of cases where we *need* to specifically bias towards a hot path with this kind of optimization, that's a different story and we should absolutely use the profile there. I'm just not currently aware of any such cases. 

https://reviews.llvm.org/D37467