[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Jingyue Wu jingyue at google.com
Tue Apr 22 17:12:11 PDT 2014


No, I don't observe any speedup.


On Sun, Apr 20, 2014 at 9:02 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> ----- Original Message -----
> > From: "Hal Finkel" <hfinkel at anl.gov>
> > To: "Jingyue Wu" <jingyue at google.com>
> > Cc: "Justin Holewinski" <jholewinski at nvidia.com>, llvmdev at cs.uiuc.edu
> > Sent: Sunday, April 20, 2014 10:01:49 PM
> > Subject: Re: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a
> group of similar GEPs
> >
> > ----- Original Message -----
> > > From: "Jingyue Wu" <jingyue at google.com>
> > > To: "Hal Finkel" <hfinkel at anl.gov>
> > > Cc: llvmdev at cs.uiuc.edu, "Eli Bendersky" <eliben at google.com>,
> > > "Justin Holewinski" <jholewinski at nvidia.com>, "Justin
> > > Holewinski" <justin.holewinski at gmail.com>
> > > Sent: Sunday, April 20, 2014 9:34:57 PM
> > > Subject: Re: [LLVMdev] [NVPTX] Eliminate common sub-expressions in
> > > a group of similar GEPs
> > >
> > >
> > > Hi Hal,
> > >
> > >
> > > Thanks for your comments! I'm inlining my responses below.
> > >
> > >
> > > Jingyue
> > >
> > >
> > >
> > > On Sat, Apr 19, 2014 at 6:38 AM, Hal Finkel < hfinkel at anl.gov >
> > > wrote:
> > >
> > >
> > > Jingyue,
> > >
> > > I can't speak for the NVPTX backend, but I think this looks useful
> > > as
> > > an (optional) target-independent pass. A few thoughts:
> > >
> > > - Running GVN tends to be pretty expensive; have you tried EarlyCSE
> > > instead? (When I was working on the BB vectorizer I was first using
> > > GVN for cleanup afterward, someone suggested trying EarlyCSE
> > > instead, the performance slowdown was a bit less than 1% on
> > > average,
> > > but the runtime impact was much less).
> > >
> > >
> > >
> > > EarlyCSE surprisingly generates code that runs 10% slower than GVN
> > > on
> > > one of our benchmarks. However, after looking into the ll and ptx,
> > > we probably shouldn't blame EarlyCSE for the slowdown. The ptx
> > > generated using EarlyCSE is only slightly different from that
> > > generated by GVN, but ptxas amplifies the seemingly unharmful
> > > difference to much more register usage. We need to further
> > > investigate this issue.
> > >
> > >
> > > I can change the code to use EarlyCSE by default, and leave a flag
> > > to
> > > run GVN.
> >
> > I think it is also perfectly reasonable (and perhaps even makes
> > sense) to choose GVN for CodeGenOpt::Aggressive and EarlyCSE
> > otherwise.
>
> Also, do you see any substantial speedups just from running a late GVN (or
> EarlyCSE) independent of the GEP optimization?
>
>  -Hal
>
> >
> > >
> > >
> > >
> > > - Are you doing this on the IR level, instead of in DAGCombine,
> > > because you want the cross-block combining from GVN? Or some other
> > > reason (or both)?
> > >
> > >
> > >
> > > Cross-block combining from GVN is one of the reasons. The second
> > > reason is we observed a GEP and its index are sometimes defined in
> > > different BBs.
> >
> > Makes sense.
> >
> > > The third reason is we also observed a GEP and the
> > > load/store that uses it are sometimes in different BBs. Although
> > > CodeGenPrepare attempts to sink a GEP to the same BB as its
> > > load/store user, it doesn't sink GEPs with two variadic indices
> > > (e.g., gep %a, 0, %i, %j) because it "smartly" recognizes ptx
> > > doesn't support any addressing mode that can fold this GEP
> >
> > That makes sense: because NVPTX does not support r+r.
> >
> > > (This
> > > addressing mode issue is worth another thread, and one of my
> > > colleagues is working on that).
> >
> > Looking forward to it.
> >
> >  -Hal
> >
> > >
> > >
> > >
> > > - To make this target independent, I think you just need to insert
> > > some calls to TLI.isLegalAddressingMode (or equivalently,
> > > TTI.isLegalAddressingMode) just to make sure that the offsets
> > > you're
> > > creating are legal on the target. This will essentially be a noop
> > > for NVPTX, but will matter for other targets.
> > >
> > >
> > >
> > > Ack'ed
> > >
> > >
> > >
> > > Thanks for posting this,
> > > Hal
> > >
> > >
> > >
> > > ----- Original Message -----
> > > > From: "Jingyue Wu" < jingyue at google.com >
> > > > To: llvmdev at cs.uiuc.edu , "Eli Bendersky" < eliben at google.com >,
> > > > "Justin Holewinski" < jholewinski at nvidia.com >, "Justin
> > > > Holewinski" < justin.holewinski at gmail.com >
> > > > Sent: Saturday, April 19, 2014 12:02:28 AM
> > > > Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a
> > > > group of similar GEPs
> > > >
> > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > >
> > > > We wrote an optimization that eliminates common sub-expressions
> > > > in
> > > > a
> > > > group of similar GEPs for the NVPTX backend. It speeds up some of
> > > > our benchmarks by up to 20%, which convinces us to try to
> > > > upstream
> > > > it. Here's a brief description of why we wrote this optimization,
> > > > what we did, and how we did it.
> > > >
> > > >
> > > > Loops in CUDA programs are often extensively unrolled by
> > > > programmers
> > > > and compilers, leading to many similar
> > > > GEPs for array accesses.
> > > >
> > > >
> > > > e.g., a 2-level loop like
> > > >
> > > >
> > > > __shared__ float a[32][32];
> > > > unroll for (int i = 0; i < 2; ++i) {
> > > > unroll for (int j = 0; j < 2; ++j) {
> > > > ...
> > > > ... = a[threadIdx.x + i][threadIdx.y + j];
> > > > ...
> > > > }
> > > > }
> > > >
> > > >
> > > > will be unrolled to:
> > > >
> > > >
> > > > gep a, 0, tid.x, tid.y; load
> > > > gep a, 0, tid.x, tid.y + 1; load
> > > > gep a, 0, tid.x + 1, tid.y; load
> > > > gep a, 0, tid.x + 1, tid.y + 1; load
> > > >
> > > >
> > > > The NVPTX backend currently doesn't handle many similar
> > > > multi-dimensional GEPs
> > > > well enough. It emits PTX code that literally computes the
> > > > pointer
> > > > address of
> > > > each GEP, wasting tons of registers. e.g., it emits the following
> > > > PTX
> > > > for the
> > > > first load and similar PTX for other loads.
> > > >
> > > >
> > > > mov.u32 %r1, %tid.x;
> > > > mov.u32 %r2, %tid.y;
> > > > mul.wide.u32 %rl2, %r1, 128;
> > > > mov.u64 %rl3, a;
> > > > add.s64 %rl4, %rl3, %rl2;
> > > > mul.wide.u32 %rl5, %r2, 4;
> > > > add.s64 %rl6, %rl4, %rl5;
> > > > ld.shared.f32 %f1, [%rl6];
> > > >
> > > >
> > > > The resultant register pressure causes up to 20% slowdown on some
> > > > of
> > > > our
> > > > benchmarks.
> > > >
> > > >
> > > > To reduce register pressure, the optimization implemented in this
> > > > patch merges
> > > > the common subexpression in a group of GEPs, saving many
> > > > registers
> > > > used for
> > > > pointer arithmetics. It works by splitting each GEP into a
> > > > variadic
> > > > base and a
> > > > constant offset. The variadic base can be computed once and
> > > > reused
> > > > by
> > > > multiple
> > > > GEPs, and the constant offsets can be nicely folded into NVPTX's
> > > > base+offset
> > > > addressing mode without using any extra register. e.g., we
> > > > transform
> > > > the four
> > > > GEPs and four loads in the above example conceptually into:
> > > >
> > > >
> > > > base = gep a, 0, x, y
> > > > load base
> > > > laod base + 1 * sizeof(float)
> > > > load base + 32 * sizeof(float)
> > > > load base + 33 * sizeof(float)
> > > >
> > > >
> > > > The resultant PTX code will look like:
> > > >
> > > >
> > > > mov.u32 %r1, %tid.x;
> > > > mov.u32 %r2, %tid.y;
> > > > mul.wide.u32 %rl2, %r1, 128;
> > > > mov.u64 %rl3, a;
> > > > add.s64 %rl4, %rl3, %rl2;
> > > > mul.wide.u32 %rl5, %r2, 4;
> > > > add.s64 %rl6, %rl4, %rl5;
> > > > ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
> > > > ld.shared.f32 %f2, [%rl6+4]; // much better
> > > > ld.shared.f32 %f3, [%rl6+128]; // much better
> > > > ld.shared.f32 %f4, [%rl6+132]; // much better
> > > >
> > > >
> > > > which uses much fewer registers than the unoptimized PTX.
> > > >
> > > >
> > > > I am attaching a proof-of-concept patch. It fully implements our
> > > > idea
> > > > and contains a contrived test case to demonstrate how it works.
> > > > It
> > > > also discusses why our implementation is safe in terms that the
> > > > optimization won't cause new undefined behavior. There's more
> > > > work
> > > > that needs to be done, e.g., adding more tests. If this idea
> > > > sounds
> > > > good to you, we will improve the patch and send it out for code
> > > > review.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > > Jingyue
> > > > _______________________________________________
> > > > LLVM Developers mailing list
> > > > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> > > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > > >
> > >
> > > --
> > > Hal Finkel
> > > Assistant Computational Scientist
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> > >
> > >
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140422/9d293792/attachment.html>


More information about the llvm-dev mailing list