[LLVMdev] [CodeGenPrepare] Sinking incoming values of a PHI Node
Jingyue Wu
jingyue at google.com
Wed May 21 14:18:43 PDT 2014
Hi Philip,
Thanks for your feedback! Quite thoughtful.
1. Is it CGP-specific or more generic?
I still think it is CGP-specific. The first two examples are probably too
simple, but the third example is more interesting:
(for simplicity, rewriting the example in pseudo-code)
if (cond) {
p = a + 4
*p
} else {
q = b + 4
}
r = phi(p, q)
*r
With my imaginary optimization, the code becomes:
if (cond) {
p = a + 4
*p
} else {
q = b + 4
}
r2 = phi(a, b)
r = r2 + 4
*r
When cond == true, the program will run two additions (p = a + 4 and r = r2
+ 4) instead of one in the original code. Therefore, this transformation
may not be a right thing to do in previous passes.
2. How does CGP decide profitability?
If an address is used (either memory use or non-memory use) multiple times,
folding it to load/store may increase the register pressure. So, to answer
your question, it's a performance consideration instead of a correctness
issue. The comments before IsProfitableToFoldIntoAddressingMode give a very
nice example. Note that, as they said, this profitability computation is
just a rough heuristic and may be too conservative.
/// IsProfitableToFoldIntoAddressingMode - It is possible for the
addressing
/// mode of the machine to fold the specified instruction into a load or
store
/// that ultimately uses it. However, the specified instruction has
multiple
/// uses. Given this, it may actually increase register pressure to fold
it
/// into the load. For example, consider this code:
///
/// X = ...
/// Y = X+1
/// use(Y) -> nonload/store
/// Z = Y+1
/// load Z
///
/// In this case, Y has multiple uses, and can be folded into the load of Z
/// (yielding load [X+2]). However, doing this will cause both "X" and
"X+1" to
/// be live at the use(Y) line. If we don't fold Y into load Z, we use one
/// fewer register. Since Y can't be folded into "use(Y)" we don't
increase the
/// number of computations either.
///
/// Note that this (like most of CodeGenPrepare) is just a rough heuristic.
If
/// X was live across 'load Z' for other reasons, we actually *would* want
to
/// fold the addressing mode in the Z case. This would make Y die earlier.
One exception where they do sink an address with multiple uses is that all
uses of this address are ultimately load/store/inlineasm's. This is
actually the case for our example: both *p and *r are memory uses.
Therefore, the issue here is relatively minor: MatchOperationAddr does not
track PHI nodes, and thus doesn't realize "p = a + 4" can be folded into
*phi(p, q).
Also, you are right in saying the profitability constraint should be "if
each load can fold", not "if each load can fold the same way". I missed
some logics there.
Thanks,
Jingyue
On Wed, May 21, 2014 at 11:04 AM, Philip Reames
<listmail at philipreames.com>wrote:
>
> On 05/20/2014 06:05 PM, Jingyue Wu wrote:
>
> Hi,
>
> I want to improve the way CGP sinks the incoming values of a PHI node
> towards memory accesses. Improving it means a lot to some of our key
> benchmarks, and I believe can benefit many general cases as well.
>
> I found your examples interesting so I'm responding, but keep in mind I
> know very little about the actual code in CGP. Feel free to tell me I
> don't know what I'm talking about. :)
>
>
> CGP's OptimizeMemoryInst function handles PHI nodes by running
> AddressingModeMatcher on all incoming values to see whether they reach
> consensus on the addressing mode. It does a reasonably good job in sinking
> a PHI node if its incoming values are used only once (by the PHI node) and
> share the same base register and offset. For example, CGP transforms
>
> define void @foo(i1 %cond, float* %a) {
> entry:
> br i1 %cond, label %if.then, label %if.else
>
> if.then:
> %p1 = getelementptr inbounds float* %a, i64 4
> br label %merge
>
> if.else:
> %q1 = getelementptr inbounds float* %a, i64 4
> br label %merge
>
> merge:
> %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
> %w1 = load float* %r1
> ret void
> }
>
> to
>
> define void @foo(i1 %cond, float* %a) {
> entry:
> br i1 %cond, label %if.then, label %if.else
>
> if.then: ; preds = %entry
> br label %merge
>
> if.else: ; preds = %entry
> br label %merge
>
> merge: ; preds = %if.else,
> %if.then
> %0 = bitcast float* %a to i8*
> %sunkaddr = getelementptr i8* %0, i64 16
> %1 = bitcast i8* %sunkaddr to float*
> %w1 = load float* %1
> ret void
> }
>
> saving registers on %p1 and %q1.
>
> I may be missing something here, but other than the type casts, why is
> this optimization specific to CGP? It seems like you've essentially just
> folded away a redundant phi. Shouldn't this be much earlier in the
> optimizer pipeline?
>
>
> However, this approach misses optimization opportunities in two cases.
> First, if an incoming value of a PHI node is used in multiple memory
> instructions, CGP may not be able to sink it.
>
> For example, if we slightly complicate the above example by adding a
> load after %p1,
>
> define void @foo(i1 %cond, float* %a) {
> entry:
> br i1 %cond, label %if.then, label %if.else
>
> if.then:
> %p1 = getelementptr inbounds float* %a, i64 4
> %u1 = load float* %p1
> br label %merge
>
> if.else:
> %q1 = getelementptr inbounds float* %a, i64 4
> br label %merge
>
> merge:
> %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
> %w1 = load float* %r1
> ret void
> }
>
> CGP is unable to sink %p1 any more, because, given %p1 is used multiple
> times, CGP checks whether %p1 can be folded into every memory uses (in this
> case, %u1 = load %p1 and %w1 = load %r1) with the same addressing mode. It
> does that by dry-running MatchOperationAddr on both loads, however
> MatchOperationAddr bails out on PHI nodes.
>
> To phrase this differently, we currently decide it's not *profitable* if
> there are extra loads right? I want to make sure there isn't a correctness
> issue I'm missing.
>
> If so, I'm not quite sure why we're requiring the address modes of two
> unrelated loads to match here. We should be able to turn this into:
>
> define void @foo(i1 %cond, float* %a) {
> entry:
> br i1 %cond, label %if.then, label %if.else
>
> if.then:
> %p1 = getelementptr inbounds float* %a, i64 4 <-- folded into load u1
> %u1 = load float* %p1
> br label %merge
>
> if.else:
> br label %merge
>
> merge:
> %r1 = getelementptr inbounds float* %a, i64 4 <-- folded into load u2
> %w1 = load float* %r1
> ret void
> }
>
> Unless I'm missing something - quite possible! - this would lead to better
> code. The profitability constraint should be "if each load can fold", not
> "if each load can fold the same way". Or am I wrong about this?
>
> Fundamentally, this still seems like a missed optimization in removing the
> phi, not an issue with CGP.
>
>
> The second complication is the incoming values of a PHI node may have
> different base registers. For example, we slightly modify the original
> example by changing %q1's base to %b.
>
> define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
> br i1 %cond, label %if.then, label %if.else
>
> if.then:
> %p1 = getelementptr inbounds float* %a, i64 4
> br label %merge
>
> if.else:
> %q1 = getelementptr inbounds float* %b, i64 4
> br label %merge
>
> merge:
> %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
> %w1 = load float* %r1
> ret void
> }
>
> Ideally we still want to fold %p1 and %q1 into the load by introducing a
> new base phi'ing %a and %b, such as
>
> %r = phi float* [ %a, %if.then ], [ %b, %if.else ]
> %r1 = getelementptr inbounds float* %r, i64 4
> %w1 = load float* %r1
>
> saving registers on %p1 and %q1.
>
> However, the consensus CGP requires all incoming values to reach
> includes not only the offset but also the base register. (a + 4) and (b +
> 4) unfortunately do not satisfy this condition. Therefore, CGP won't sink
> %p1 and %q1 toward BB %merge.
>
> Extending this code to handle separate base values seems quite
> reasonable. You could generalize this a lot, but starting with different
> base values (i.e. insert a phi, use that) seems like a good starting
> place.
>
> Again, this really sounds like something which should be earlier in the
> optimization pipeline though. The heart of this is simply removing an
> unnecessary phi.
>
>
> Our key benchmarks have both cases I described above. A simplified code
> looks like:
>
> define void @foo(i1 %cond, float* %a, float* %b) {
> entry:
> br i1 %cond, label %if.then, label %if.else
>
> if.then:
> %p1 = getelementptr inbounds float* %a, i64 4
> %u1 = load float* %p1
> br label %merge
>
> if.else:
> %q1 = getelementptr inbounds float* %b, i64 4
> br label %merge
>
> merge:
> %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]
> %w1 = load float* %r1
> ret void
> }
>
> so we need a solution to attack both issues. While we can solve the
> second issue by introducing a new base register leveraging SSAUpdater, I am
> not sure how to fold this complicated logic in MatchOperationAddr which
> updates one single target addressing mode. One possibility is to extend
> ExtAddrMode to support multiple base registers, which may complicate a lot
> of things.
>
> While I am thinking about an appropriate solution, I would like to bring
> these issues up to get more feedback. I really appreciate your thoughts on
> this.
>
> Thanks!
> Jingyue
>
>
> _______________________________________________
> LLVM Developers mailing listLLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.eduhttp://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140521/e58a897b/attachment.html>
More information about the llvm-dev
mailing list