[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Sat Apr 19 09:17:18 PDT 2014

not yet, but I am more than happy to try. any good benchmarks in your mind?
On Apr 19, 2014 7:34 AM, "Justin Holewinski" <justin.holewinski at gmail.com>
wrote:

> This looks great!  I'm a bit surprised the existing IR optimizers do not
> handle this.  I agree with Hal that this should be made target-independent.
>  I don't see anything here that would be specific to NVPTX.  Do you have
> any performance data for open-source benchmarks?
>
>
> On Sat, Apr 19, 2014 at 9:38 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>
>> Jingyue,
>>
>> I can't speak for the NVPTX backend, but I think this looks useful as an
>> (optional) target-independent pass. A few thoughts:
>>
>>  - Running GVN tends to be pretty expensive; have you tried EarlyCSE
>> instead? (When I was working on the BB vectorizer I was first using GVN for
>> cleanup afterward, someone suggested trying EarlyCSE instead, the
>> performance slowdown was a bit less than 1% on average, but the runtime
>> impact was much less).
>>
>>  - Are you doing this on the IR level, instead of in DAGCombine, because
>> you want the cross-block combining from GVN? Or some other reason (or both)?
>>
>>  - To make this target independent, I think you just need to insert some
>> calls to TLI.isLegalAddressingMode (or equivalently,
>> TTI.isLegalAddressingMode) just to make sure that the offsets you're
>> creating are legal on the target. This will essentially be a noop for
>> NVPTX, but will matter for other targets.
>>
>> Thanks for posting this,
>> Hal
>>
>> ----- Original Message -----
>> > From: "Jingyue Wu" <jingyue at google.com>
>> > To: llvmdev at cs.uiuc.edu, "Eli Bendersky" <eliben at google.com>, "Justin
>> Holewinski" <jholewinski at nvidia.com>, "Justin
>> > Holewinski" <justin.holewinski at gmail.com>
>> > Sent: Saturday, April 19, 2014 12:02:28 AM
>> > Subject: [LLVMdev] [NVPTX] Eliminate common sub-expressions in a group
>> of     similar GEPs
>> >
>> >
>> >
>> >
>> > Hi,
>> >
>> >
>> > We wrote an optimization that eliminates common sub-expressions in a
>> > group of similar GEPs for the NVPTX backend. It speeds up some of
>> > our benchmarks by up to 20%, which convinces us to try to upstream
>> > it. Here's a brief description of why we wrote this optimization,
>> > what we did, and how we did it.
>> >
>> >
>> > Loops in CUDA programs are often extensively unrolled by programmers
>> > and compilers, leading to many similar
>> > GEPs for array accesses.
>> >
>> >
>> > e.g., a 2-level loop like
>> >
>> >
>> > __shared__ float a[32][32];
>> > unroll for (int i = 0; i < 2; ++i) {
>> > unroll for (int j = 0; j < 2; ++j) {
>> > ...
>> > ... = a[threadIdx.x + i][threadIdx.y + j];
>> > ...
>> > }
>> > }
>> >
>> >
>> > will be unrolled to:
>> >
>> >
>> > gep a, 0, tid.x, tid.y; load
>> > gep a, 0, tid.x, tid.y + 1; load
>> > gep a, 0, tid.x + 1, tid.y; load
>> > gep a, 0, tid.x + 1, tid.y + 1; load
>> >
>> >
>> > The NVPTX backend currently doesn't handle many similar
>> > multi-dimensional GEPs
>> > well enough. It emits PTX code that literally computes the pointer
>> > address of
>> > each GEP, wasting tons of registers. e.g., it emits the following PTX
>> > for the
>> > first load and similar PTX for other loads.
>> >
>> >
>> > mov.u32 %r1, %tid.x;
>> > mov.u32 %r2, %tid.y;
>> > mul.wide.u32 %rl2, %r1, 128;
>> > mov.u64 %rl3, a;
>> > add.s64 %rl4, %rl3, %rl2;
>> > mul.wide.u32 %rl5, %r2, 4;
>> > add.s64 %rl6, %rl4, %rl5;
>> > ld.shared.f32 %f1, [%rl6];
>> >
>> >
>> > The resultant register pressure causes up to 20% slowdown on some of
>> > our
>> > benchmarks.
>> >
>> >
>> > To reduce register pressure, the optimization implemented in this
>> > patch merges
>> > the common subexpression in a group of GEPs, saving many registers
>> > used for
>> > pointer arithmetics. It works by splitting each GEP into a variadic
>> > base and a
>> > constant offset. The variadic base can be computed once and reused by
>> > multiple
>> > GEPs, and the constant offsets can be nicely folded into NVPTX's
>> > base+offset
>> > addressing mode without using any extra register. e.g., we transform
>> > the four
>> > GEPs and four loads in the above example conceptually into:
>> >
>> >
>> > base = gep a, 0, x, y
>> > load base
>> > laod base + 1 * sizeof(float)
>> > load base + 32 * sizeof(float)
>> > load base + 33 * sizeof(float)
>> >
>> >
>> > The resultant PTX code will look like:
>> >
>> >
>> > mov.u32 %r1, %tid.x;
>> > mov.u32 %r2, %tid.y;
>> > mul.wide.u32 %rl2, %r1, 128;
>> > mov.u64 %rl3, a;
>> > add.s64 %rl4, %rl3, %rl2;
>> > mul.wide.u32 %rl5, %r2, 4;
>> > add.s64 %rl6, %rl4, %rl5;
>> > ld.shared.f32 %f1, [%rl6]; // so far the same as unoptimized PTX
>> > ld.shared.f32 %f2, [%rl6+4]; // much better
>> > ld.shared.f32 %f3, [%rl6+128]; // much better
>> > ld.shared.f32 %f4, [%rl6+132]; // much better
>> >
>> >
>> > which uses much fewer registers than the unoptimized PTX.
>> >
>> >
>> > I am attaching a proof-of-concept patch. It fully implements our idea
>> > and contains a contrived test case to demonstrate how it works. It
>> > also discusses why our implementation is safe in terms that the
>> > optimization won't cause new undefined behavior. There's more work
>> > that needs to be done, e.g., adding more tests. If this idea sounds
>> > good to you, we will improve the patch and send it out for code
>> > review.
>> >
>> >
>> >
>> > Thanks,
>> > Jingyue
>> > _______________________________________________
>> > LLVM Developers mailing list
>> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> >
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
>
>
> --
>
> Thanks,
>
> Justin Holewinski
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140419/01927edb/attachment.html>