[PATCH] D28129: NewGVN: Sort Dominator Tree in RPO order, and use that for generating order.

Wed Dec 28 22:08:41 PST 2016

On 2016-12-28 20:09, Daniel Berlin wrote:
> On Wed, Dec 28, 2016 at 7:22 PM, Friedman, Eli
> <efriedma at codeaurora.org> wrote:
> 
>> On 12/28/2016 2:33 PM, Daniel Berlin wrote:
>> 
>> On Wed, Dec 28, 2016 at 1:18 PM, Friedman, Eli
>> <efriedma at codeaurora.org> wrote:
>> 
>> On 12/28/2016 1:03 PM, Daniel Berlin via llvm-commits wrote:
>> 
>> On Wed, Dec 28, 2016 at 7:04 AM, Davide Italiano via Phabricator
>> <reviews at reviews.llvm.org> wrote:
>> davide accepted this revision.
>> davide added a comment.
>> This revision is now accepted and ready to land.
>> 
>> Sorry for the slow response, I'm out('ish) of the office these days.
>> I took a close look at your patch.
>> 
>> No worries.
>> I happen to be lucky enough to hit a case in the wild where this
>> already matters.  The number of iteration goes down from hundreds to
>> ~10, which makes compile time/me happier.
>> 
>> yay.
>> 
>> The current code, excepting super-weird cases, should operate in
>> O(d+3) iterations, where d is the loop connectedness of the SSA
>> graph (not the CFG), which is the number of backedges in any path.
>> This will change when we move to equality propagation, but for now,
>> ...
>> We could  calculate this number and see if we are screwing up :)
>> 
>> For most programs, the loop connectedness of the SSA graph is the
>> same or less than the CFG.
>> 
>> However,  IIRC, we allow  dependent phis in the same block (this is
>> not strictly SSA, since all phi nodes are supposed to be evaluated
>> simultaneously).
> 
>  I'm not sure what you're trying to say here?  PHI nodes for a given
> basic block are evaluated simultaneously. From LangRef: "For the
> purposes of the SSA form, the use of each incoming value is deemed to
> occur on the edge from the corresponding predecessor block to the
> current block (but after any definition of an ‘invoke‘
> instruction’s return value on the same edge)."
> 
> I'm saying we've had mailing list arguments about this, about whether
> there is any ordering among phi nodes in a given block.  The part you
> quote from the langref does not actually definitively answer that
> (again, there is no argument in theory.  In the abstract, the answer
> is "there is no ordering, it's undefined to have phis depend in the
> same block depend on each other")
> 
> Given
> 
> b = phi(d, e)
> a = phi(b, c)
> 
> Saying "is deemed to occur on the edge of the corresponding
> predecessor block" does not help.
>  Consider the following function:
> 
> void f(int a, int b, int g(int, int)) {
>   while (g(a, b)) { int x = a; a = b; b = x; }
> }
> 
> mem2reg produces this:
> 
> define void @f(i32 %a, i32 %b, i32 (i32, i32)* %g) #0 {
> entry:
>   br label %while.cond
> 
> while.cond:                                       ; preds =
> %while.body, %entry
>   %a.addr.0 = phi i32 [ %a, %entry ], [ %b.addr.0, %while.body ]
>   %b.addr.0 = phi i32 [ %b, %entry ], [ %a.addr.0, %while.body ]
>   %call = call i32 %g(i32 %a.addr.0, i32 %b.addr.0)
>   %tobool = icmp ne i32 %call, 0
>   br i1 %tobool, label %while.body, label %while.end
> 
> while.body:                                       ; preds =
> %while.cond
>   br label %while.cond
> 
> while.end:                                        ; preds =
> %while.cond
>   ret void
> }
> 
> A "phi" works in the only way which allows this IR to match the
> semantics of the C code.
> 
> I'm not sure i believe this, but i actually don't care enough to argue
> about it further :)
> 
> That said, i'm curious:
> 
> What are the expected semantics of the second case i presented?

Both your functions are equivalent to something like this:

void widget() {
   int aa = 5;
   int t1 = 0;
   while (true) {
     // dosomething(aa, t1);
     int x = aa; aa = t1 + 1; t1 = x;
   }
}

They're also equivalent to this:

define void @widget() {
entry:
   br label %bb2

bb2:                                              ; preds = %bb4, %entry
   %aa = phi i64 [ 5, %entry ], [ %t5, %bb4]
   %t1 = phi i64 [ 0, %entry ], [ %aacopy, %bb4 ]
   br label %bb4

bb4:                                              ; preds = %bb3, %bb2
   %t5 = add i64 %t1, 1
   %aacopy = add i64 %aa, 0
   br label %bb2
}

>> If you think LangRef isn't clear, suggestions are welcome.
> 
> I would be explicit that phi nodes may depend on each other, and what
> the expected evaluation order actually is (if it's
> "as-they-appear-in-IR", say that.)
> It looks something like "dependencies are allowed, any order is
> allowed despites dependencies".

There is no evaluation order; alternatively, every possible evaluation 
order is equivalent.  If a PHI node refers to another PHI node in the 
same basic block, it's actually referring to the value that PHI node had 
in the predecessor.

>>> Popping back up, regardless of resolution, this causes the issue i
>>> mentioned above - it may require more iterations to resolve
>>> because of the second case passing verification.  If we really
>>> want phi nodes to be executable , and want it to take the minimum
>>> number of iterations to converge NewGVN, we need to process aa
>>> before t1.
>>> 
>>> Otherwise, we will process t1, get some value, *then* process aa,
>>> and immediately mark t1 as needing to be reprocessed since it is a
>>> use of aa.  We effectively waste an iteration because all of t1's
>>> uses have are going to have the wrong value.
>> 
>> It looks like NewGVN creates one less congruence class if you
>> process them in the "right" order.
> 
> Well, no. It's not about creating the congruence class, it's about
> making it a reverse post-order evaluation .  In your example above,
> there is no single RPO order due to the two cycles.
> In my example, any valid  RPO order of the SSA graph must visit aa
> before t1.
> This is *normally* taken care of by evaluating instructions in block
> order, and ordering the CFG in RPO.   But in the case i gave, it is
> not enough.
> You would have to explicitly sort the phi nodes separately, since we
> allow both orderings of the phi nodes in LLVM,despite only one being
> RPO (and only one having defs dominate uses).

This sort of goes back to the discussion above... fundamentally both 
your example and mine are cyclical.  You've just found a rule for 
quickly processing a certain narrow class of cycles.

-----

Another slightly more complicated case:

define void @cyclical_adds() {
entry:
   br label %bb2

bb2:                                              ; preds = %bb4, %entry
   %aa = phi i64 [ 5, %entry ], [ %t5, %bb4]
   %t1 = phi i64 [ 0, %entry ], [ %aa1, %bb4 ]
   br label %bb4

bb4:                                              ; preds = %bb3, %bb2
   ; a bunch of code using aa and t1
   %t5 = add i64 %t1, 1
   %aa1 = add i64 %aa, 1
   br label %bb2
}

If you can optimize the evaluation order here, I think the solution 
would also cover your original example.

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a 
Linux Foundation Collaborative Project