[LLVMdev] [NVPTX] Eliminate common sub-expressions in a group of similar GEPs

Fri Apr 18 22:02:28 PDT 2014

Hi,

We wrote an optimization that eliminates common sub-expressions in a group
of similar GEPs for the NVPTX backend. It speeds up some of our benchmarks
by up to 20%, which convinces us to try to upstream it. Here's a brief
description of why we wrote this optimization, what we did, and how we did
it.

Loops in CUDA programs are often extensively unrolled by programmers and
compilers, leading to many similar
GEPs for array accesses.

e.g., a 2-level loop like

__shared__ float a[32][32];
unroll for (int i = 0; i < 2; ++i) {
  unroll for (int j = 0; j < 2; ++j) {
    ...
    ... = a[threadIdx.x + i][threadIdx.y + j];
    ...
  }
}

will be unrolled to:

gep a, 0, tid.x, tid.y; load
gep a, 0, tid.x, tid.y + 1; load
gep a, 0, tid.x + 1, tid.y; load
gep a, 0, tid.x + 1, tid.y + 1; load

The NVPTX backend currently doesn't handle many similar multi-dimensional
GEPs
well enough. It emits PTX code that literally computes the pointer address
of
each GEP, wasting tons of registers. e.g., it emits the following PTX for
the
first load and similar PTX for other loads.

mov.u32         %r1, %tid.x;
mov.u32         %r2, %tid.y;
mul.wide.u32    %rl2, %r1, 128;
mov.u64         %rl3, a;
add.s64         %rl4, %rl3, %rl2;
mul.wide.u32    %rl5, %r2, 4;
add.s64         %rl6, %rl4, %rl5;
ld.shared.f32   %f1, [%rl6];

The resultant register pressure causes up to 20% slowdown on some of our
benchmarks.

To reduce register pressure, the optimization implemented in this patch
merges
the common subexpression in a group of GEPs, saving many registers used for
pointer arithmetics. It works by splitting each GEP into a variadic base
and a
constant offset. The variadic base can be computed once and reused by
multiple
GEPs, and the constant offsets can be nicely folded into NVPTX's base+offset
addressing mode without using any extra register. e.g., we transform the
four
GEPs and four loads in the above example conceptually into:

base = gep a, 0, x, y
load base
laod base + 1  * sizeof(float)
load base + 32 * sizeof(float)
load base + 33 * sizeof(float)

The resultant PTX code will look like:

mov.u32         %r1, %tid.x;
mov.u32         %r2, %tid.y;
mul.wide.u32    %rl2, %r1, 128;
mov.u64         %rl3, a;
add.s64         %rl4, %rl3, %rl2;
mul.wide.u32    %rl5, %r2, 4;
add.s64         %rl6, %rl4, %rl5;
ld.shared.f32   %f1, [%rl6]; // so far the same as unoptimized PTX
*ld.shared.f32   %f2, [%rl6+4]; // much better*
*ld.shared.f32   %f3, [%rl6+128]; // much better*
*ld.shared.f32   %f4, [%rl6+132]; // much better*

which uses much fewer registers than the unoptimized PTX.

I am attaching a proof-of-concept patch. It fully implements our idea and
contains a contrived test case to demonstrate how it works. It also
discusses why our implementation is safe in terms that the optimization
won't cause new undefined behavior. There's more work that needs to be
done, e.g., adding more tests. If this idea sounds good to you, we will
improve the patch and send it out for code review.

Thanks,
Jingyue
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/22d60889/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gep-cse.patch
Type: application/octet-stream
Size: 25103 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140418/22d60889/attachment.obj>