[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

Wed May 21 11:04:36 PDT 2014

On 05/20/2014 06:05 PM, Jingyue Wu wrote:
> Hi,
>
> I want to improve the way CGP sinks the incoming values of a PHI node 
> towards memory accesses. Improving it means a lot to some of our key 
> benchmarks, and I believe can benefit many general cases as well.
I found your examples interesting so I'm responding, but keep in mind I 
know very little about the actual code in CGP.  Feel free to tell me I 
don't know what I'm talking about.  :)
>
> CGP's OptimizeMemoryInst function handles PHI nodes by running 
> AddressingModeMatcher on all incoming values to see whether they reach 
> consensus on the addressing mode. It does a reasonably good job in 
> sinking a PHI node if its incoming values are used only once (by the 
> PHI node) and share the same base register and offset. For example, 
> CGP transforms
>
> define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
> if.else:
>   %q1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
> to
>
> define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
> if.then:                                          ; preds = %entry
>   br label %merge
>
> if.else:                                          ; preds = %entry
>   br label %merge
>
> merge:                                            ; preds = %if.else, 
> %if.then
>   %0 = bitcast float* %a to i8*
>   %sunkaddr = getelementptr i8* %0, i64 16
>   %1 = bitcast i8* %sunkaddr to float*
>   %w1 = load float* %1
>   ret void
> }
>
> saving registers on %p1 and %q1.
I may be missing something here, but other than the type casts, why is 
this optimization specific to CGP?  It seems like you've essentially 
just folded away a redundant phi.  Shouldn't this be much earlier in the 
optimizer pipeline?
>
> However, this approach misses optimization opportunities in two cases. 
> First, if an incoming value of a PHI node is used in multiple memory 
> instructions, CGP may not be able to sink it.
>
> For example, if we slightly complicate the above example by adding a 
> load after %p1,
>
> define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   %u1 = load float* %p1
>   br label %merge
>
> if.else:
>   %q1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
> CGP is unable to sink %p1 any more, because, given %p1 is used 
> multiple times, CGP checks whether %p1 can be folded into every memory 
> uses (in this case, %u1 = load %p1 and %w1 = load %r1) with the same 
> addressing mode. It does that by dry-running MatchOperationAddr on 
> both loads, however MatchOperationAddr bails out on PHI nodes.
To phrase this differently, we currently decide it's not *profitable* if 
there are extra loads right?  I want to make sure there isn't a 
correctness issue I'm missing.

If so, I'm not quite sure why we're requiring the address modes of two 
unrelated loads to match here. We should be able to turn this into:

define void @foo(i1 %cond, float* %a) {
entry:
   br i1 %cond, label %if.then, label %if.else

if.then:
   %p1 = getelementptr inbounds float* %a, i64 4 <-- folded into load u1
   %u1 = load float* %p1
   br label %merge

if.else:
   br label %merge

merge:
   %r1 = getelementptr inbounds float* %a, i64 4 <-- folded into load u2
   %w1 = load float* %r1
   ret void
}

Unless I'm missing something - quite possible! - this would lead to 
better code.  The profitability constraint should be "if each load can 
fold", not "if each load can fold the same way".  Or am I wrong about this?

Fundamentally, this still seems like a missed optimization in removing 
the phi, not an issue with CGP.

>
> The second complication is the incoming values of a PHI node may have 
> different base registers. For example, we slightly modify the original 
> example by changing %q1's base to %b.
>
> define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
> if.else:
>   %q1 = getelementptr inbounds float* %b, i64 4
>   br label %merge
>
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
> Ideally we still want to fold %p1 and %q1 into the load by introducing 
> a new base phi'ing %a and %b, such as
>
> %r = phi float* [ %a, %if.then ], [ %b, %if.else ]
> %r1 = getelementptr inbounds float* %r, i64 4
> %w1 = load float* %r1
>
> saving registers on %p1 and %q1.
>
> However, the consensus CGP requires all incoming values to reach 
> includes not only the offset but also the base register. (a + 4) and 
> (b + 4) unfortunately do not satisfy this condition. Therefore, CGP 
> won't sink %p1 and %q1 toward BB %merge.
Extending this code to handle separate base values seems quite 
reasonable.  You could generalize this a lot, but starting with 
different base values (i.e. insert a phi, use that) seems like a good 
starting place.

Again, this really sounds like something which should be earlier in the 
optimization pipeline though.  The heart of this is simply removing an 
unnecessary phi.
>
> Our key benchmarks have both cases I described above. A simplified 
> code looks like:
>
> define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
> if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   %u1 = load float* %p1
>   br label %merge
>
> if.else:
>   %q1 = getelementptr inbounds float* %b, i64 4
>   br label %merge
>
> merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
> so we need a solution to attack both issues. While we can solve the 
> second issue by introducing a new base register leveraging SSAUpdater, 
> I am not sure how to fold this complicated logic in MatchOperationAddr 
> which updates one single target addressing mode. One possibility is to 
> extend ExtAddrMode to support multiple base registers, which may 
> complicate a lot of things.
>
> While I am thinking about an appropriate solution, I would like to 
> bring these issues up to get more feedback. I really appreciate your 
> thoughts on this.
>
> Thanks!
> Jingyue
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140521/e571f9de/attachment.html>