<div dir="ltr">No, I don't observe any speedup. </div><div class="gmail_extra"><br><br><div class="gmail_quote">On Sun, Apr 20, 2014 at 9:02 PM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">----- Original Message -----<br>

> From: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>><br>

> To: "Jingyue Wu" <<a href="mailto:jingyue@google.com">jingyue@google.com</a>><br>

> Cc: "Justin Holewinski" <<a href="mailto:jholewinski@nvidia.com">jholewinski@nvidia.com</a>>, <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a><br>

> Sent: Sunday, April 20, 2014 10:01:49 PM<br>

> Subject: Re: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs<br>

><br>

> ----- Original Message -----<br>

> > From: "Jingyue Wu" <<a href="mailto:jingyue@google.com">jingyue@google.com</a>><br>

> > To: "Hal Finkel" <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>><br>

> > Cc: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a>, "Eli Bendersky" <<a href="mailto:eliben@google.com">eliben@google.com</a>>,<br>

> > "Justin Holewinski" <<a href="mailto:jholewinski@nvidia.com">jholewinski@nvidia.com</a>>, "Justin<br>

> > Holewinski" <<a href="mailto:justin.holewinski@gmail.com">justin.holewinski@gmail.com</a>><br>

> > Sent: Sunday, April 20, 2014 9:34:57 PM<br>

> > Subject: Re: [LLVMdev] [NVPTX] Eliminate common sub-expressions in<br>

> > a group of similar GEPs<br>

> ><br>

> ><br>

> > Hi Hal,<br>

> ><br>

> ><br>

> > Thanks for your comments! I'm inlining my responses below.<br>

> ><br>

> ><br>

> > Jingyue<br>

> ><br>

> ><br>

> ><br>

> > On Sat, Apr 19, 2014 at 6:38 AM, Hal Finkel < <a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a> ><br>

> > wrote:<br>

> ><br>

> ><br>

> > Jingyue,<br>

> ><br>

> > I can't speak for the NVPTX backend, but I think this looks useful<br>

> > as<br>

> > an (optional) target-independent pass. A few thoughts:<br>

> ><br>

> > - Running GVN tends to be pretty expensive; have you tried EarlyCSE<br>

> > instead? (When I was working on the BB vectorizer I was first using<br>

> > GVN for cleanup afterward, someone suggested trying EarlyCSE<br>

> > instead, the performance slowdown was a bit less than 1% on<br>

> > average,<br>

> > but the runtime impact was much less).<br>

> ><br>

> ><br>

> ><br>

> > EarlyCSE surprisingly generates code that runs 10% slower than GVN<br>

> > on<br>

> > one of our benchmarks. However, after looking into the ll and ptx,<br>

> > we probably shouldn't blame EarlyCSE for the slowdown. The ptx<br>

> > generated using EarlyCSE is only slightly different from that<br>

> > generated by GVN, but ptxas amplifies the seemingly unharmful<br>

> > difference to much more register usage. We need to further<br>

> > investigate this issue.<br>

> ><br>

> ><br>

> > I can change the code to use EarlyCSE by default, and leave a flag<br>

> > to<br>

> > run GVN.<br>

><br>

> I think it is also perfectly reasonable (and perhaps even makes<br>

> sense) to choose GVN for CodeGenOpt::Aggressive and EarlyCSE<br>

> otherwise.<br>

<br>

</div></div>Also, do you see any substantial speedups just from running a late GVN (or EarlyCSE) independent of the GEP optimization?<br>

<span class="HOEnZb"><font color="#888888"><br>

 -Hal<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

><br>

> ><br>

> ><br>

> ><br>

> > - Are you doing this on the IR level, instead of in DAGCombine,<br>

> > because you want the cross-block combining from GVN? Or some other<br>

> > reason (or both)?<br>

> ><br>

> ><br>

> ><br>

> > Cross-block combining from GVN is one of the reasons. The second<br>

> > reason is we observed a GEP and its index are sometimes defined in<br>

> > different BBs.<br>

><br>

> Makes sense.<br>

><br>

> > The third reason is we also observed a GEP and the<br>

> > load/store that uses it are sometimes in different BBs. Although<br>

> > CodeGenPrepare attempts to sink a GEP to the same BB as its<br>

> > load/store user, it doesn't sink GEPs with two variadic indices<br>

> > (e.g., gep %a, 0, %i, %j) because it "smartly" recognizes ptx<br>

> > doesn't support any addressing mode that can fold this GEP<br>

><br>

> That makes sense: because NVPTX does not support r+r.<br>

><br>

> > (This<br>

> > addressing mode issue is worth another thread, and one of my<br>

> > colleagues is working on that).<br>

><br>

> Looking forward to it.<br>

><br>

>  -Hal<br>

><br>

> ><br>

> ><br>

> ><br>

> > - To make this target independent, I think you just need to insert<br>

> > some calls to TLI.isLegalAddressingMode (or equivalently,<br>

> > TTI.isLegalAddressingMode) just to make sure that the offsets<br>

> > you're<br>

> > creating are legal on the target. This will essentially be a noop<br>

> > for NVPTX, but will matter for other targets.<br>

> ><br>

> ><br>

> ><br>

> > Ack'ed<br>

> ><br>

> ><br>

> ><br>

> > Thanks for posting this,<br>

> > Hal<br>

> ><br>

> ><br>

> ><br>

> > ----- Original Message -----<br>

> > > From: "Jingyue Wu" < <a href="mailto:jingyue@google.com">jingyue@google.com</a> ><br>

> > > To: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a> , "Eli Bendersky" < <a href="mailto:eliben@google.com">eliben@google.com</a> >,<br>

> > > "Justin Holewinski" < <a href="mailto:jholewinski@nvidia.com">jholewinski@nvidia.com</a> >, "Justin<br>

> > > Holewinski" < <a href="mailto:justin.holewinski@gmail.com">justin.holewinski@gmail.com</a> ><br>

> > > Sent: Saturday, April 19, 2014 12:02:28 AM<br>

> > > Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a<br>

> > > group of similar GEPs<br>

> > ><br>

> > ><br>

> > ><br>

> > ><br>

> > > Hi,<br>

> > ><br>

> > ><br>

> > > We wrote an optimization that eliminates common sub-expressions<br>

> > > in<br>

> > > a<br>

> > > group of similar GEPs for the NVPTX backend. It speeds up some of<br>

> > > our benchmarks by up to 20%, which convinces us to try to<br>

> > > upstream<br>

> > > it. Here's a brief description of why we wrote this optimization,<br>

> > > what we did, and how we did it.<br>

> > ><br>

> > ><br>

> > > Loops in CUDA programs are often extensively unrolled by<br>

> > > programmers<br>

> > > and compilers, leading to many similar<br>

> > > GEPs for array accesses.<br>

> > ><br>

> > ><br>

> > > e.g., a 2-level loop like<br>

> > ><br>

> > ><br>

> > > __shared__ float a[32][32];<br>

> > > unroll for (int i = 0; i < 2; ++i) {<br>

> > > unroll for (int j = 0; j < 2; ++j) {<br>

> > > ...<br>

> > > ... = a[threadIdx.x + i][threadIdx.y + j];<br>

> > > ...<br>

> > > }<br>

> > > }<br>

> > ><br>

> > ><br>

> > > will be unrolled to:<br>

> > ><br>

> > ><br>

> > > gep a, 0, tid.x, tid.y; load<br>

> > > gep a, 0, tid.x, tid.y + 1; load<br>

> > > gep a, 0, tid.x + 1, tid.y; load<br>

> > > gep a, 0, tid.x + 1, tid.y + 1; load<br>

> > ><br>

> > ><br>

> > > The NVPTX backend currently doesn't handle many similar<br>

> > > multi-dimensional GEPs<br>

> > > well enough. It emits PTX code that literally computes the<br>

> > > pointer<br>

> > > address of<br>

> > > each GEP, wasting tons of registers. e.g., it emits the following<br>

> > > PTX<br>

> > > for the<br>

> > > first load and similar PTX for other loads.<br>

> > ><br>

> > ><br>

> > > mov.u32 %r1, %tid.x;<br>

> > > mov.u32 %r2, %tid.y;<br>

> > > mul.wide.u32 %rl2, %r1, 128;<br>

> > > mov.u64 %rl3, a;<br>

> > > add.s64 %rl4, %rl3, %rl2;<br>

> > > mul.wide.u32 %rl5, %r2, 4;<br>

> > > add.s64 %rl6, %rl4, %rl5;<br>

> > > ld.shared.f32 %f1, [%rl6];<br>

> > ><br>

> > ><br>

> > > The resultant register pressure causes up to 20% slowdown on some<br>

> > > of<br>

> > > our<br>

> > > benchmarks.<br>

> > ><br>

> > ><br>

> > > To reduce register pressure, the optimization implemented in this<br>

> > > patch merges<br>

> > > the common subexpression in a group of GEPs, saving many<br>

> > > registers<br>

> > > used for<br>

> > > pointer arithmetics. It works by splitting each GEP into a<br>

> > > variadic<br>

> > > base and a<br>

> > > constant offset. The variadic base can be computed once and<br>

> > > reused<br>

> > > by<br>

> > > multiple<br>

> > > GEPs, and the constant offsets can be nicely folded into NVPTX's<br>

> > > base+offset<br>

> > > addressing mode without using any extra register. e.g., we<br>

> > > transform<br>

> > > the four<br>

> > > GEPs and four loads in the above example conceptually into:<br>

> > ><br>

> > ><br>

> > > base = gep a, 0, x, y<br>

> > > load base<br>

> > > laod base + 1 * sizeof(float)<br>

> > > load base + 32 * sizeof(float)<br>

> > > load base + 33 * sizeof(float)<br>

> > ><br>

> > ><br>

> > > The resultant PTX code will look like:<br>

> > ><br>

> > ><br>

> > > mov.u32 %r1, %tid.x;<br>

> > > mov.u32 %r2, %tid.y;<br>

> > > mul.wide.u32 %rl2, %r1, 128;<br>

> > > mov.u64 %rl3, a;<br>

> > > add.s64 %rl4, %rl3, %rl2;<br>

> > > mul.wide.u32 %rl5, %r2, 4;<br>

> > > add.s64 %rl6, %rl4, %rl5;<br>

> > > ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX<br>

> > > ld.shared.f32 %f2, [%rl6+4]; // much better<br>

> > > ld.shared.f32 %f3, [%rl6+128]; // much better<br>

> > > ld.shared.f32 %f4, [%rl6+132]; // much better<br>

> > ><br>

> > ><br>

> > > which uses much fewer registers than the unoptimized PTX.<br>

> > ><br>

> > ><br>

> > > I am attaching a proof-of-concept patch. It fully implements our<br>

> > > idea<br>

> > > and contains a contrived test case to demonstrate how it works.<br>

> > > It<br>

> > > also discusses why our implementation is safe in terms that the<br>

> > > optimization won't cause new undefined behavior. There's more<br>

> > > work<br>

> > > that needs to be done, e.g., adding more tests. If this idea<br>

> > > sounds<br>

> > > good to you, we will improve the patch and send it out for code<br>

> > > review.<br>

> > ><br>

> > ><br>

> > ><br>

> > > Thanks,<br>

> > > Jingyue<br>

> > > _______________________________________________<br>

> > > LLVM Developers mailing list<br>

> > > <a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a> <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

> > > <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

> > ><br>

> ><br>

> > --<br>

> > Hal Finkel<br>

> > Assistant Computational Scientist<br>

> > Leadership Computing Facility<br>

> > Argonne National Laboratory<br>

> ><br>

> ><br>

><br>

> --<br>

> Hal Finkel<br>

> Assistant Computational Scientist<br>

> Leadership Computing Facility<br>

> Argonne National Laboratory<br>

> _______________________________________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

><br>

<br>

--<br>

Hal Finkel<br>

Assistant Computational Scientist<br>

Leadership Computing Facility<br>

Argonne National Laboratory<br>

</div></div></blockquote></div><br></div>