[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node

Wed May 21 14:18:43 PDT 2014

Hi Philip,

Thanks for your feedback! Quite thoughtful.

1. Is it CGP-specific or more generic?

I still think it is CGP-specific. The first two examples are probably too
simple, but the third example is more interesting:

(for simplicity, rewriting the example in pseudo-code)

if (cond) {
  p = a + 4
  *p
} else {
  q = b + 4
}
r = phi(p, q)
*r

With my imaginary optimization, the code becomes:

if (cond) {
  p = a + 4
  *p
} else {
  q = b + 4
}
r2 = phi(a, b)
r = r2 + 4
*r

When cond == true, the program will run two additions (p = a + 4 and r = r2
+ 4) instead of one in the original code. Therefore, this transformation
may not be a right thing to do in previous passes.

2. How does CGP decide profitability?

If an address is used (either memory use or non-memory use) multiple times,
folding it to load/store may increase the register pressure. So, to answer
your question, it's a performance consideration instead of a correctness
issue. The comments before IsProfitableToFoldIntoAddressingMode give a very
nice example. Note that, as they said, this profitability computation is
just a rough heuristic and may be too conservative.

/// IsProfitableToFoldIntoAddressingMode - It is possible for the
addressing
/// mode of the machine to fold the specified instruction into a load or
store
/// that ultimately uses it.  However, the specified instruction has
multiple
/// uses.  Given this, it may actually increase register pressure to fold
it
/// into the load.  For example, consider this code:

///

///     X = ...

///     Y = X+1

///     use(Y)   -> nonload/store

///     Z = Y+1

///     load Z

///

/// In this case, Y has multiple uses, and can be folded into the load of Z

/// (yielding load [X+2]).  However, doing this will cause both "X" and
"X+1" to
/// be live at the use(Y) line.  If we don't fold Y into load Z, we use one

/// fewer register.  Since Y can't be folded into "use(Y)" we don't
increase the
/// number of computations either.

///

/// Note that this (like most of CodeGenPrepare) is just a rough heuristic.
 If
/// X was live across 'load Z' for other reasons, we actually *would* want
to
/// fold the addressing mode in the Z case.  This would make Y die earlier.

One exception where they do sink an address with multiple uses is that all
uses of this address are ultimately load/store/inlineasm's. This is
actually the case for our example: both *p and *r are memory uses.
Therefore, the issue here is relatively minor: MatchOperationAddr does not
track PHI nodes, and thus doesn't realize "p = a + 4" can be folded into
*phi(p, q).

Also, you are right in saying the profitability constraint should be "if
each load can fold", not "if each load can fold the same way". I missed
some logics there.

Thanks,
Jingyue

On Wed, May 21, 2014 at 11:04 AM, Philip Reames
<listmail at philipreames.com>wrote:

>
> On 05/20/2014 06:05 PM, Jingyue Wu wrote:
>
> Hi,
>
>  I want to improve the way CGP sinks the incoming values of a PHI node
> towards memory accesses. Improving it means a lot to some of our key
> benchmarks, and I believe can benefit many general cases as well.
>
> I found your examples interesting so I'm responding, but keep in mind I
> know very little about the actual code in CGP.  Feel free to tell me I
> don't know what I'm talking about.  :)
>
>
>  CGP's OptimizeMemoryInst function handles PHI nodes by running
> AddressingModeMatcher on all incoming values to see whether they reach
> consensus on the addressing mode. It does a reasonably good job in sinking
> a PHI node if its incoming values are used only once (by the PHI node) and
> share the same base register and offset. For example, CGP transforms
>
>  define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
>  if.else:
>   %q1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
>  merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
>  to
>
>  define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:                                          ; preds = %entry
>   br label %merge
>
>  if.else:                                          ; preds = %entry
>   br label %merge
>
>  merge:                                            ; preds = %if.else,
> %if.then
>   %0 = bitcast float* %a to i8*
>   %sunkaddr = getelementptr i8* %0, i64 16
>   %1 = bitcast i8* %sunkaddr to float*
>   %w1 = load float* %1
>   ret void
> }
>
>  saving registers on %p1 and %q1.
>
> I may be missing something here, but other than the type casts, why is
> this optimization specific to CGP?  It seems like you've essentially just
> folded away a redundant phi.  Shouldn't this be much earlier in the
> optimizer pipeline?
>
>
>  However, this approach misses optimization opportunities in two cases.
> First, if an incoming value of a PHI node is used in multiple memory
> instructions, CGP may not be able to sink it.
>
>  For example, if we slightly complicate the above example by adding a
> load after %p1,
>
>   define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   %u1 = load float* %p1
>   br label %merge
>
>  if.else:
>   %q1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
>  merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
>  CGP is unable to sink %p1 any more, because, given %p1 is used multiple
> times, CGP checks whether %p1 can be folded into every memory uses (in this
> case, %u1 = load %p1 and %w1 = load %r1) with the same addressing mode. It
> does that by dry-running MatchOperationAddr on both loads, however
> MatchOperationAddr bails out on PHI nodes.
>
> To phrase this differently, we currently decide it's not *profitable* if
> there are extra loads right?  I want to make sure there isn't a correctness
> issue I'm missing.
>
> If so, I'm not quite sure why we're requiring the address modes of two
> unrelated loads to match here. We should be able to turn this into:
>
>  define void @foo(i1 %cond, float* %a) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4 <-- folded into load u1
>    %u1 = load float* %p1
>   br label %merge
>
>  if.else:
>   br label %merge
>
>  merge:
>   %r1 = getelementptr inbounds float* %a, i64 4 <-- folded into load u2
>    %w1 = load float* %r1
>   ret void
> }
>
> Unless I'm missing something - quite possible! - this would lead to better
> code.  The profitability constraint should be "if each load can fold", not
> "if each load can fold the same way".  Or am I wrong about this?
>
> Fundamentally, this still seems like a missed optimization in removing the
> phi, not an issue with CGP.
>
>
>  The second complication is the incoming values of a PHI node may have
> different base registers. For example, we slightly modify the original
> example by changing %q1's base to %b.
>
>   define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   br label %merge
>
>  if.else:
>   %q1 = getelementptr inbounds float* %b, i64 4
>   br label %merge
>
>  merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
>  Ideally we still want to fold %p1 and %q1 into the load by introducing a
> new base phi'ing %a and %b, such as
>
>  %r = phi float* [ %a, %if.then ], [ %b, %if.else ]
> %r1 = getelementptr inbounds float* %r, i64 4
> %w1 = load float* %r1
>
>  saving registers on %p1 and %q1.
>
>  However, the consensus CGP requires all incoming values to reach
> includes not only the offset but also the base register. (a + 4) and (b +
> 4) unfortunately do not satisfy this condition. Therefore, CGP won't sink
> %p1 and %q1 toward BB %merge.
>
> Extending this code to handle separate base values seems quite
> reasonable.  You could generalize this a lot, but starting with different
> base values (i.e. insert a phi, use that) seems like a good starting
> place.
>
> Again, this really sounds like something which should be earlier in the
> optimization pipeline though.  The heart of this is simply removing an
> unnecessary phi.
>
>
>  Our key benchmarks have both cases I described above. A simplified code
> looks like:
>
>   define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
>   br i1 %cond, label %if.then, label %if.else
>
>  if.then:
>   %p1 = getelementptr inbounds float* %a, i64 4
>   %u1 = load float* %p1
>   br label %merge
>
>  if.else:
>   %q1 = getelementptr inbounds float* %b, i64 4
>   br label %merge
>
>  merge:
>   %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
>   %w1 = load float* %r1
>   ret void
> }
>
>  so we need a solution to attack both issues. While we can solve the
> second issue by introducing a new base register leveraging SSAUpdater, I am
> not sure how to fold this complicated logic in MatchOperationAddr which
> updates one single target addressing mode. One possibility is to extend
> ExtAddrMode to support multiple base registers, which may complicate a lot
> of things.
>
>  While I am thinking about an appropriate solution, I would like to bring
> these issues up to get more feedback. I really appreciate your thoughts on
> this.
>
>  Thanks!
> Jingyue
>
>
> _______________________________________________
> LLVM Developers mailing listLLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140521/e58a897b/attachment.html>