<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
<div class="moz-cite-prefix">On 05/20/2014 06:05 PM, Jingyue Wu
wrote:<br>
</div>
<blockquote
cite="mid:CAMROOrFV6obBMRnxUubYfES+Kmgx4tMpUgUYAJjQKgEXX0EbgA@mail.gmail.com"
type="cite">
<div dir="ltr">Hi,
<div><br>
</div>
<div>I want to improve the way CGP sinks the incoming values of
a PHI node towards memory accesses. Improving it means a lot
to some of our key benchmarks, and I believe can benefit many
general cases as well. <br>
</div>
</div>
</blockquote>
I found your examples interesting so I'm responding, but keep in
mind I know very little about the actual code in CGP. Feel free to
tell me I don't know what I'm talking about. :)<br>
<blockquote
cite="mid:CAMROOrFV6obBMRnxUubYfES+Kmgx4tMpUgUYAJjQKgEXX0EbgA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div><br>
</div>
<div>CGP's OptimizeMemoryInst function handles PHI nodes by
running AddressingModeMatcher on all incoming values to see
whether they reach consensus on the addressing mode. It does a
reasonably good job in sinking a PHI node if its incoming
values are used only once (by the PHI node) and share the same
base register and offset. For example, CGP transforms</div>
<div>
<div><br>
</div>
<div>
<div>define void @foo(i1 %cond, float* %a) {</div>
<div>entry:</div>
<div> br i1 %cond, label %if.then, label %if.else</div>
<div><br>
</div>
<div>if.then:</div>
<div> %p1 = getelementptr inbounds float* %a, i64 4</div>
<div> br label %merge<br>
</div>
<div><br>
</div>
<div>if.else:</div>
<div> %q1 = getelementptr inbounds float* %a, i64 4</div>
<div> br label %merge</div>
<div><br>
</div>
<div>merge:</div>
<div> %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]</div>
<div> %w1 = load float* %r1</div>
<div> ret void</div>
<div>}</div>
</div>
</div>
<div><br>
</div>
<div>to</div>
<div><br>
</div>
<div>
<div>define void @foo(i1 %cond, float* %a) {</div>
<div>entry:</div>
<div> br i1 %cond, label %if.then, label %if.else</div>
<div><br>
</div>
<div>if.then: ; preds
= %entry</div>
<div> br label %merge</div>
<div><br>
</div>
<div>if.else: ; preds
= %entry</div>
<div> br label %merge</div>
<div><br>
</div>
<div>merge: ; preds
= %if.else, %if.then</div>
<div> %0 = bitcast float* %a to i8*</div>
<div> %sunkaddr = getelementptr i8* %0, i64 16</div>
<div> %1 = bitcast i8* %sunkaddr to float*</div>
<div> %w1 = load float* %1</div>
<div> ret void</div>
<div>}</div>
</div>
<div><br>
</div>
<div>saving registers on %p1 and %q1. <br>
</div>
</div>
</blockquote>
I may be missing something here, but other than the type casts, why
is this optimization specific to CGP? It seems like you've
essentially just folded away a redundant phi. Shouldn't this be
much earlier in the optimizer pipeline? <br>
<blockquote
cite="mid:CAMROOrFV6obBMRnxUubYfES+Kmgx4tMpUgUYAJjQKgEXX0EbgA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div><br>
</div>
<div>However, this approach misses optimization opportunities in
two cases. First, if an incoming value of a PHI node is used
in multiple memory instructions, CGP may not be able to sink
it. </div>
<div><br>
</div>
<div>For example, if we slightly complicate the above example by
adding a load after %p1, </div>
<div><br>
</div>
<div>
<div>
<div>define void @foo(i1 %cond, float* %a) {</div>
<div>entry:</div>
<div> br i1 %cond, label %if.then, label %if.else</div>
<div><br>
</div>
<div>if.then:</div>
<div> %p1 = getelementptr inbounds float* %a, i64 4</div>
<div> %u1 = load float* %p1</div>
<div> br label %merge<br>
</div>
<div><br>
</div>
<div>if.else:</div>
<div> %q1 = getelementptr inbounds float* %a, i64 4</div>
<div> br label %merge</div>
<div><br>
</div>
<div>merge:</div>
<div> %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]</div>
<div> %w1 = load float* %r1</div>
<div> ret void</div>
<div>}</div>
</div>
</div>
<div><br>
</div>
<div>
CGP is unable to sink %p1 any more, because, given %p1 is used
multiple times, CGP checks whether %p1 can be folded into
every memory uses (in this case, %u1 = load %p1 and %w1 = load
%r1) with the same addressing mode. It does that by
dry-running MatchOperationAddr on both loads, however
MatchOperationAddr bails out on PHI nodes. <br>
</div>
</div>
</blockquote>
To phrase this differently, we currently decide it's not
*profitable* if there are extra loads right? I want to make sure
there isn't a correctness issue I'm missing. <br>
<br>
If so, I'm not quite sure why we're requiring the address modes of
two unrelated loads to match here. We should be able to turn this
into:<br>
<br>
<div>
<div>
<div>define void @foo(i1 %cond, float* %a) {</div>
<div>entry:</div>
<div> br i1 %cond, label %if.then, label %if.else</div>
<div><br>
</div>
<div>if.then:</div>
<div> %p1 = getelementptr inbounds float* %a, i64 4 <--
folded into load u1<br>
</div>
<div> %u1 = load float* %p1</div>
<div> br label %merge<br>
</div>
<div><br>
</div>
<div>if.else:</div>
br label %merge
<div><br>
</div>
<div>merge:<br>
%r1 = getelementptr inbounds float* %a, i64 4 <-- folded
into load u2<br>
</div>
<div> %w1 = load float* %r1</div>
<div> ret void</div>
<div>}<br>
<br>
Unless I'm missing something - quite possible! - this would
lead to better code. The profitability constraint should be
"if each load can fold", not "if each load can fold the same
way". Or am I wrong about this?<br>
<br>
Fundamentally, this still seems like a missed optimization in
removing the phi, not an issue with CGP. <br>
</div>
</div>
</div>
<br>
<blockquote
cite="mid:CAMROOrFV6obBMRnxUubYfES+Kmgx4tMpUgUYAJjQKgEXX0EbgA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div><br>
</div>
<div>The second complication is the incoming values of a PHI
node may have different base registers. For example, we
slightly modify the original example by changing %q1's base to
%b.</div>
<div><br>
</div>
<div>
<div>
<div>define void @foo(i1 %cond, float* %a, float* %b) {</div>
<div>entry:</div>
<div> br i1 %cond, label %if.then, label %if.else</div>
<div><br>
</div>
<div>if.then:</div>
<div> %p1 = getelementptr inbounds float* %a, i64 4</div>
<div> br label %merge<br>
</div>
<div><br>
</div>
<div>if.else:</div>
<div> %q1 = getelementptr inbounds float* %b, i64 4</div>
<div> br label %merge</div>
<div><br>
</div>
<div>merge:</div>
<div> %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]</div>
<div> %w1 = load float* %r1</div>
<div> ret void</div>
<div>}</div>
</div>
</div>
<div><br>
</div>
<div>Ideally we still want to fold %p1 and %q1 into the load by
introducing a new base phi'ing %a and %b, such as</div>
<div>
<br>
</div>
<div>
<div>%r = phi float* [ %a, %if.then ], [ %b, %if.else ]</div>
<div>%r1 = getelementptr inbounds float* %r, i64 4</div>
<div>%w1 = load float* %r1</div>
</div>
<div><br>
</div>
<div>saving registers on %p1 and %q1. </div>
<div><br>
</div>
<div>However, the consensus CGP requires all incoming values to
reach includes not only the offset but also the base register.
(a + 4) and (b + 4) unfortunately do not satisfy this
condition. Therefore, CGP won't sink %p1 and %q1 toward BB
%merge. <br>
</div>
</div>
</blockquote>
Extending this code to handle separate base values seems quite
reasonable. You could generalize this a lot, but starting with
different base values (i.e. insert a phi, use that) seems like a
good starting place. <br>
<br>
Again, this really sounds like something which should be earlier in
the optimization pipeline though. The heart of this is simply
removing an unnecessary phi. <br>
<blockquote
cite="mid:CAMROOrFV6obBMRnxUubYfES+Kmgx4tMpUgUYAJjQKgEXX0EbgA@mail.gmail.com"
type="cite">
<div dir="ltr">
<div><br>
</div>
<div>Our key benchmarks have both cases I described above. A
simplified code looks like:</div>
<div><br>
</div>
<div>
<div>
<div>define void @foo(i1 %cond, float* %a, float* %b) {</div>
<div>entry:</div>
<div> br i1 %cond, label %if.then, label %if.else</div>
<div><br>
</div>
<div>if.then:</div>
<div> %p1 = getelementptr inbounds float* %a, i64 4</div>
<div> %u1 = load float* %p1</div>
<div> br label %merge<br>
</div>
<div><br>
</div>
<div>if.else:</div>
<div> %q1 = getelementptr inbounds float* %b, i64 4</div>
<div> br label %merge<br>
</div>
<div><br>
</div>
<div>merge:</div>
<div> %r1 = phi float* [ %p1, %if.then ], [ %q1, %if.else ]</div>
<div> %w1 = load float* %r1</div>
<div> ret void</div>
<div>}</div>
</div>
</div>
<div><br>
</div>
<div>so we need a solution to attack both issues. While we can
solve the second issue by introducing a new base register
leveraging SSAUpdater, I am not sure how to fold this
complicated logic in MatchOperationAddr which updates one
single target addressing mode. One possibility is to extend
ExtAddrMode to support multiple base registers, which may
complicate a lot of things.</div>
<div><br>
</div>
<div>While I am thinking about an appropriate solution, I would
like to bring these issues up to get more feedback. I really
appreciate your thoughts on this. </div>
<div><br>
</div>
<div>Thanks! </div>
<div>Jingyue</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
LLVM Developers mailing list
<a class="moz-txt-link-abbreviated" href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a class="moz-txt-link-freetext" href="http://llvm.cs.uiuc.edu">http://llvm.cs.uiuc.edu</a>
<a class="moz-txt-link-freetext" href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a>
</pre>
</blockquote>
<br>
</body>
</html>