[PATCH] D37467: Add a new pass to speculate around PHI nodes with constant (integer) operands when profitable.

Tue Oct 17 11:16:06 PDT 2017

chandlerc added inline comments.

================
Comment at: llvm/lib/Transforms/Scalar/SpeculateAroundPHIs.cpp:256
+
+  DEBUG(dbgs() << "    Cost savings " << (MatCost - FoldedImmCost) << ": " << PN
+               << "\n");
----------------
davidxl wrote:
> chandlerc wrote:
> > chandlerc wrote:
> > > davidxl wrote:
> > > > chandlerc wrote:
> > > > > davidxl wrote:
> > > > > > The cost analysis here assumes non-speculative case where the uses of phi  are fully anticipated. Consider the real speculative case:
> > > > > > 
> > > > > > 
> > > > > > ```
> > > > > > 
> > > > > > /// entry:
> > > > > > ///     br i1 %flag, label %a, label %b
> > > > > > ///
> > > > > > ///   a:
> > > > > > ///     br label %merge
> > > > > > ///
> > > > > > ///   b:
> > > > > > ///     br label %merge
> > > > > > ///
> > > > > > ///   merge:
> > > > > > ///     %p = phi i32 [ 7, %a ], [ 11, %b ]
> > > > > > ///     br i1 %flag2, label %use, label %exit
> > > > > > ///
> > > > > > ///   use:
> > > > > > ///     %sum = add i32 %arg, %p
> > > > > > ///     br label %exit
> > > > > > ///
> > > > > > ///   exit:
> > > > > > ///    %s = phi ([0, ...], [%sum, ..])
> > > > > > ///     ret i32 %s
> > > > > > 
> > > > > > ```
> > > > > > Hoisting the add into %a and %b is speculative. Additional speculation cost is involved depending on the cost of add and relative frequency of %a+%b vs %use.
> > > > > > 
> > > > > > 
> > > > > > consider another case with triagular cfg:
> > > > > > 
> > > > > > ```
> > > > > > 
> > > > > > /// entry:
> > > > > > ///     br i1 %flag, label %a, label %merge
> > > > > > ///   a:
> > > > > > ///     br label %merge
> > > > > > ///
> > > > > > ///   merge:
> > > > > > ///     %p = phi i32 [ 7, %a ], [ 11, %entry ]
> > > > > > ///     %sum = add i32 %arg, %p
> > > > > > ///    ret i32 %sum
> > > > > > 
> > > > > > ```
> > > > > > 
> > > > > > hoisting 'add' into entry and '%a' introduces runtime overhead.
> > > > > Ah, this explains perhaps the comment above regarding BFI.
> > > > > 
> > > > > The second case was definitely unintended. I can add something to prevent this. I'm not sure we *ever* want to do that case.
> > > > > 
> > > > > For the first case, I'm not so sure. If the 'add' is the same cost as materializing the constant, and we'll have to materialize the constant anyways, is this really a problem? We can suppress this too if so.
> > > > > 
> > > > > I'm somewhat interested in simpler cost modeling and just eliminating the hard cases here if possible (IE, until we see places where the fancier approach is needed).
> > > > As an example for the first case. Suppose there are two incoming constants C1, and C2.     M(C1) and M(C2) are the cost of materializing the constants, and F(C1) and F(C2) are the costs of folding them into UI.   Assuming the frequency of two incoming blocks are P1 and P2.
> > > > 
> > > > The total cost before will be:   P1 * M(C1) + P2 * M (C2), and the total cost after is P1 * F(C1) + P2 * F(C2).
> > > > 
> > > > The comparison can be reduced to comparing  M(C1) + M(C2) with F(C1) + F(C2) only when
> > > > 1) P1 == P2
> > > > or
> > > > 2) M(C1) == M(C2) and F(C1) == F(C2)
> > > > 
> > > > Depending on the value of C1 and C2, 2) may not be true.
> > > > 
> > > I should clarify why the second case you mentioned originally cannot in fact happen:
> > > 
> > > The code splits critical edges so that it should not be changing the total number of instructions executed along any given path. The speculation goes into a non-critical edge, and so it shouldn't go onto a path that wouldn't have executed that exact instruction anyways.
> > > 
> > > And again, there are already tests of this.
> > > 
> > > For the first case, I see what you're getting at with this example. However, I'm worried about scaling by probability. My concern is that I'm a little worried about doing way too much speculation just to remove a constant materialization in the case of fairly extremely hot paths. A very minor improvement along a very hot path could have nearly unbounded code size regressions along the cold path.
> > > 
> > > An alternative, simpler approach would be to check that, in addition to `F(C1) + F(C2) < M(C1) + M(C2)` being true, `F(C1) <= M(C1) && F(C2) <= M(C2)`. That should ensure that no path regresses its dynamic cost, and the total isn't a size regression overall. I haven't looked at it in detail, but something like this might make for an easier model than profile and still avoid regressions.
> > > 
> > > Naturally, if you're aware of cases where we *need* to specifically bias towards a hot path with this kind of optimization, that's a different story and we should absolutely use the profile there. I'm just not currently aware of any such cases. 
> > I've implemented this and it isn't bad at all.
> > 
> > When initially establishing if there is any savings to be had from speculating, we check that no incoming constant becomes worse with folding, and then compute the total savings.
> > 
> > I found a way to test for it, but it is somewhat lame because the *speculation* cost (which is separate from the cost mentioned here) ends up too high anyways. However, I've added the test case and verified that in fact we catch the lack of profitability earlier with the new code. Hopefully this addresses the last concern here. We can of course revisit this and add profile-based heuristics if there is a need for them, but all the test cases I have are happy with the strict "no regression" approach.
> Can you clarify 'the second case you mentioned originally cannot in fact happen"?
> 
> I don't see a test case with it. The similar one is @test_no_spec_dominating_phi, but that the phi use of that case still post dominates the phi so it is not really speculative hoisting.  If you move the use into '%a' or '%b' block, the hoisting will actually change the dynamic instruction counts.
Wow, sorry, I missed adding this test case, sorry about that. Just updated patch to include it. You can find the code that handles this by looking at critical edge splitting.

https://reviews.llvm.org/D37467