[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Justin Holewinski justin.holewinski at gmail.com
Sat Apr 19 07:34:57 PDT 2014


This looks great!  I'm a bit surprised the existing IR optimizers do not
handle this.  I agree with Hal that this should be made target-independent.
 I don't see anything here that would be specific to NVPTX.  Do you have
any performance data for open-source benchmarks?


On Sat, Apr 19, 2014 at 9:38 AM, Hal Finkel <hfinkel at anl.gov> wrote:

> Jingyue,
>
> I can't speak for the NVPTX backend, but I think this looks useful as an
> (optional) target-independent pass. A few thoughts:
>
>  - Running GVN tends to be pretty expensive; have you tried EarlyCSE
> instead? (When I was working on the BB vectorizer I was first using GVN for
> cleanup afterward, someone suggested trying EarlyCSE instead, the
> performance slowdown was a bit less than 1% on average, but the runtime
> impact was much less).
>
>  - Are you doing this on the IR level, instead of in DAGCombine, because
> you want the cross-block combining from GVN? Or some other reason (or both)?
>
>  - To make this target independent, I think you just need to insert some
> calls to TLI.isLegalAddressingMode (or equivalently,
> TTI.isLegalAddressingMode) just to make sure that the offsets you're
> creating are legal on the target. This will essentially be a noop for
> NVPTX, but will matter for other targets.
>
> Thanks for posting this,
> Hal
>
> ----- Original Message -----
> > From: "Jingyue Wu" <jingyue at google.com>
> > To: llvmdev at cs.uiuc.edu, "Eli Bendersky" <eliben at google.com>, "Justin
> Holewinski" <jholewinski at nvidia.com>, "Justin
> > Holewinski" <justin.holewinski at gmail.com>
> > Sent: Saturday, April 19, 2014 12:02:28 AM
> > Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group
> of     similar GEPs
> >
> >
> >
> >
> > Hi,
> >
> >
> > We wrote an optimization that eliminates common sub-expressions in a
> > group of similar GEPs for the NVPTX backend. It speeds up some of
> > our benchmarks by up to 20%, which convinces us to try to upstream
> > it. Here's a brief description of why we wrote this optimization,
> > what we did, and how we did it.
> >
> >
> > Loops in CUDA programs are often extensively unrolled by programmers
> > and compilers, leading to many similar
> > GEPs for array accesses.
> >
> >
> > e.g., a 2-level loop like
> >
> >
> > __shared__ float a[32][32];
> > unroll for (int i = 0; i < 2; ++i) {
> > unroll for (int j = 0; j < 2; ++j) {
> > ...
> > ... = a[threadIdx.x + i][threadIdx.y + j];
> > ...
> > }
> > }
> >
> >
> > will be unrolled to:
> >
> >
> > gep a, 0, tid.x, tid.y; load
> > gep a, 0, tid.x, tid.y + 1; load
> > gep a, 0, tid.x + 1, tid.y; load
> > gep a, 0, tid.x + 1, tid.y + 1; load
> >
> >
> > The NVPTX backend currently doesn't handle many similar
> > multi-dimensional GEPs
> > well enough. It emits PTX code that literally computes the pointer
> > address of
> > each GEP, wasting tons of registers. e.g., it emits the following PTX
> > for the
> > first load and similar PTX for other loads.
> >
> >
> > mov.u32 %r1, %tid.x;
> > mov.u32 %r2, %tid.y;
> > mul.wide.u32 %rl2, %r1, 128;
> > mov.u64 %rl3, a;
> > add.s64 %rl4, %rl3, %rl2;
> > mul.wide.u32 %rl5, %r2, 4;
> > add.s64 %rl6, %rl4, %rl5;
> > ld.shared.f32 %f1, [%rl6];
> >
> >
> > The resultant register pressure causes up to 20% slowdown on some of
> > our
> > benchmarks.
> >
> >
> > To reduce register pressure, the optimization implemented in this
> > patch merges
> > the common subexpression in a group of GEPs, saving many registers
> > used for
> > pointer arithmetics. It works by splitting each GEP into a variadic
> > base and a
> > constant offset. The variadic base can be computed once and reused by
> > multiple
> > GEPs, and the constant offsets can be nicely folded into NVPTX's
> > base+offset
> > addressing mode without using any extra register. e.g., we transform
> > the four
> > GEPs and four loads in the above example conceptually into:
> >
> >
> > base = gep a, 0, x, y
> > load base
> > laod base + 1 * sizeof(float)
> > load base + 32 * sizeof(float)
> > load base + 33 * sizeof(float)
> >
> >
> > The resultant PTX code will look like:
> >
> >
> > mov.u32 %r1, %tid.x;
> > mov.u32 %r2, %tid.y;
> > mul.wide.u32 %rl2, %r1, 128;
> > mov.u64 %rl3, a;
> > add.s64 %rl4, %rl3, %rl2;
> > mul.wide.u32 %rl5, %r2, 4;
> > add.s64 %rl6, %rl4, %rl5;
> > ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
> > ld.shared.f32 %f2, [%rl6+4]; // much better
> > ld.shared.f32 %f3, [%rl6+128]; // much better
> > ld.shared.f32 %f4, [%rl6+132]; // much better
> >
> >
> > which uses much fewer registers than the unoptimized PTX.
> >
> >
> > I am attaching a proof-of-concept patch. It fully implements our idea
> > and contains a contrived test case to demonstrate how it works. It
> > also discusses why our implementation is safe in terms that the
> > optimization won't cause new undefined behavior. There's more work
> > that needs to be done, e.g., adding more tests. If this idea sounds
> > good to you, we will improve the patch and send it out for code
> > review.
> >
> >
> >
> > Thanks,
> > Jingyue
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>



-- 

Thanks,

Justin Holewinski
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140419/c321e138/attachment.html>


More information about the llvm-dev mailing list