<div dir="ltr">This looks great!  I'm a bit surprised the existing IR optimizers do not handle this.  I agree with Hal that this should be made target-independent.  I don't see anything here that would be specific to NVPTX.  Do you have any performance data for open-source benchmarks?</div>

<div class="gmail_extra"><br><br><div class="gmail_quote">On Sat, Apr 19, 2014 at 9:38 AM, Hal Finkel <span dir="ltr"><<a href="mailto:hfinkel@anl.gov" target="_blank">hfinkel@anl.gov</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Jingyue,<br>

<br>

I can't speak for the NVPTX backend, but I think this looks useful as an (optional) target-independent pass. A few thoughts:<br>

<br>

 - Running GVN tends to be pretty expensive; have you tried EarlyCSE instead? (When I was working on the BB vectorizer I was first using GVN for cleanup afterward, someone suggested trying EarlyCSE instead, the performance slowdown was a bit less than 1% on average, but the runtime impact was much less).<br>


<br>

 - Are you doing this on the IR level, instead of in DAGCombine, because you want the cross-block combining from GVN? Or some other reason (or both)?<br>

<br>

 - To make this target independent, I think you just need to insert some calls to TLI.isLegalAddressingMode (or equivalently, TTI.isLegalAddressingMode) just to make sure that the offsets you're creating are legal on the target. This will essentially be a noop for NVPTX, but will matter for other targets.<br>


<br>

Thanks for posting this,<br>

Hal<br>

<div><div class="h5"><br>

----- Original Message -----<br>

> From: "Jingyue Wu" <<a href="mailto:jingyue@google.com">jingyue@google.com</a>><br>

> To: <a href="mailto:llvmdev@cs.uiuc.edu">llvmdev@cs.uiuc.edu</a>, "Eli Bendersky" <<a href="mailto:eliben@google.com">eliben@google.com</a>>, "Justin Holewinski" <<a href="mailto:jholewinski@nvidia.com">jholewinski@nvidia.com</a>>, "Justin<br>


> Holewinski" <<a href="mailto:justin.holewinski@gmail.com">justin.holewinski@gmail.com</a>><br>

> Sent: Saturday, April 19, 2014 12:02:28 AM<br>

> Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of     similar GEPs<br>

><br>

><br>

><br>

><br>

> Hi,<br>

><br>

><br>

> We wrote an optimization that eliminates common sub-expressions in a<br>

> group of similar GEPs for the NVPTX backend. It speeds up some of<br>

> our benchmarks by up to 20%, which convinces us to try to upstream<br>

> it. Here's a brief description of why we wrote this optimization,<br>

> what we did, and how we did it.<br>

><br>

><br>

> Loops in CUDA programs are often extensively unrolled by programmers<br>

> and compilers, leading to many similar<br>

> GEPs for array accesses.<br>

><br>

><br>

> e.g., a 2-level loop like<br>

><br>

><br>

> __shared__ float a[32][32];<br>

> unroll for (int i = 0; i < 2; ++i) {<br>

> unroll for (int j = 0; j < 2; ++j) {<br>

> ...<br>

> ... = a[threadIdx.x + i][threadIdx.y + j];<br>

> ...<br>

> }<br>

> }<br>

><br>

><br>

> will be unrolled to:<br>

><br>

><br>

> gep a, 0, tid.x, tid.y; load<br>

> gep a, 0, tid.x, tid.y + 1; load<br>

> gep a, 0, tid.x + 1, tid.y; load<br>

> gep a, 0, tid.x + 1, tid.y + 1; load<br>

><br>

><br>

> The NVPTX backend currently doesn't handle many similar<br>

> multi-dimensional GEPs<br>

> well enough. It emits PTX code that literally computes the pointer<br>

> address of<br>

> each GEP, wasting tons of registers. e.g., it emits the following PTX<br>

> for the<br>

> first load and similar PTX for other loads.<br>

><br>

><br>

> mov.u32 %r1, %tid.x;<br>

> mov.u32 %r2, %tid.y;<br>

> mul.wide.u32 %rl2, %r1, 128;<br>

> mov.u64 %rl3, a;<br>

> add.s64 %rl4, %rl3, %rl2;<br>

> mul.wide.u32 %rl5, %r2, 4;<br>

> add.s64 %rl6, %rl4, %rl5;<br>

> ld.shared.f32 %f1, [%rl6];<br>

><br>

><br>

> The resultant register pressure causes up to 20% slowdown on some of<br>

> our<br>

> benchmarks.<br>

><br>

><br>

> To reduce register pressure, the optimization implemented in this<br>

> patch merges<br>

> the common subexpression in a group of GEPs, saving many registers<br>

> used for<br>

> pointer arithmetics. It works by splitting each GEP into a variadic<br>

> base and a<br>

> constant offset. The variadic base can be computed once and reused by<br>

> multiple<br>

> GEPs, and the constant offsets can be nicely folded into NVPTX's<br>

> base+offset<br>

> addressing mode without using any extra register. e.g., we transform<br>

> the four<br>

> GEPs and four loads in the above example conceptually into:<br>

><br>

><br>

> base = gep a, 0, x, y<br>

> load base<br>

> laod base + 1 * sizeof(float)<br>

> load base + 32 * sizeof(float)<br>

> load base + 33 * sizeof(float)<br>

><br>

><br>

> The resultant PTX code will look like:<br>

><br>

><br>

> mov.u32 %r1, %tid.x;<br>

> mov.u32 %r2, %tid.y;<br>

> mul.wide.u32 %rl2, %r1, 128;<br>

> mov.u64 %rl3, a;<br>

> add.s64 %rl4, %rl3, %rl2;<br>

> mul.wide.u32 %rl5, %r2, 4;<br>

> add.s64 %rl6, %rl4, %rl5;<br>

> ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX<br>

> ld.shared.f32 %f2, [%rl6+4]; // much better<br>

> ld.shared.f32 %f3, [%rl6+128]; // much better<br>

> ld.shared.f32 %f4, [%rl6+132]; // much better<br>

><br>

><br>

> which uses much fewer registers than the unoptimized PTX.<br>

><br>

><br>

> I am attaching a proof-of-concept patch. It fully implements our idea<br>

> and contains a contrived test case to demonstrate how it works. It<br>

> also discusses why our implementation is safe in terms that the<br>

> optimization won't cause new undefined behavior. There's more work<br>

> that needs to be done, e.g., adding more tests. If this idea sounds<br>

> good to you, we will improve the patch and send it out for code<br>

> review.<br>

><br>

><br>

><br>

> Thanks,<br>

> Jingyue<br>

</div></div>> _______________________________________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:LLVMdev@cs.uiuc.edu">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" target="_blank">http://llvm.cs.uiuc.edu</a><br>

> <a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

><br>

<span class="HOEnZb"><font color="#888888"><br>

--<br>

Hal Finkel<br>

Assistant Computational Scientist<br>

Leadership Computing Facility<br>

Argonne National Laboratory<br>

</font></span></blockquote></div><br><br clear="all"><div><br></div>-- <br><br><div>Thanks,</div><div><br></div><div>Justin Holewinski</div>

</div>