[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs

Fri Oct 24 13:02:15 PDT 2014

HI Justin,

37 w/o widening, and 40 w/ widening.

There is some other weirdness in the register allocation: I didn't specify
any upper bound on register usage, but ptxas on the version w/ widening
aggressively rematerializes some arithmetics instructions for fewer
registers. Nevertheless, it still uses more registers than the version w/o
widening.

Btw, Justin, do you have time to take a look at this (
http://reviews.llvm.org/D5612)? Eli and I think it's OK, but would like you
to confirm.

Jingyue

On Fri Oct 24 2014 at 11:29:33 AM Justin Holewinski <jholewinski at nvidia.com>
wrote:

> On Fri, 24 Oct 2014, Jingyue Wu wrote:
>
> > Hi,
> > I noticed a significant performance regression (up to 40%) on some
> internal CUDA benchmarks (a reduced example presented below). The root
> cause of this regression seems
> > that IndVarSimpilfy widens induction variables assuming arithmetics on
> wider integer types are as cheap as those on narrower ones. However, this
> assumption is wrong at
> > least for the NVPTX64 target.
> >
> > Although the NVPTX64 target supports 64-bit arithmetics, since the
> actual NVIDIA GPU typically has only 32-bit integer registers, one 64-bit
> arithmetic typically ends up
> > with two machine instructions taking care of the low 32 bits and the
> high 32 bits respectively. I haven't looked at other GPU targets such as
> R600, but I suspect this
> > problem is not restricted to the NVPTX64 target.
> >
> > Below is a reduced example:
> > __attribute__((global)) void foo(int n, int *output) {
> >   for (int i = 0; i < n; i += 3) {
> >     output[i] = i * i;
> >   }
> > }
> >
> > Without widening, the loop body in the PTX (a low-level assembly-like
> language generated by NVPTX64) is:
> > BB0_2:                                  // =>This Inner Loop Header:
> Depth=1
> >         mul.lo.s32      %r5, %r6, %r6;
>
> >         st.u32  [%rd4], %r5;
>
> >         add.s32         %r6, %r6, 3;
>
> >         add.s64         %rd4, %rd4, 12;
>
> >         setp.lt.s32     %p2, %r6, %r3;
> >         @%p2 bra        BB0_2;
> > in which %r6 is the induction variable i.
> >
> > With widening, the loop body becomes:
> > BB0_2:                                  // =>This Inner Loop Header:
> Depth=1
> >         mul.lo.s64      %rd8, %rd10, %rd10;
>
> >         st.u32  [%rd9], %rd8;
>
> >         add.s64         %rd10, %rd10, 3;
>
> >         add.s64         %rd9, %rd9, 12;
>
> >         setp.lt.s64     %p2, %rd10, %rd1;
>
> >         @%p2 bra        BB0_2;
> >
> > Although the number of PTX instructions in both versions are the same,
> the version with widening uses more mul.lo.s64, add.s64, and setp.lt.s64
> instructions which are
> > more expensive than their 32-bit counterparts. Indeed, the SASS code
> (disassembly of the actual machine code running on GPUs) of the version
> with widening looks
> > significantly longer.
> >
> > Without widening (7 instructions):
> > .L_1:
>
> >         /*0048*/                IMUL R2, R0, R0;
>
> >         /*0050*/                IADD R0, R0, 0x1;
>
> >         /*0058*/                ST.E [R4], R2;
>
> >         /*0060*/                ISETP.NE.AND P0, PT, R0, c[0x0][0x140],
> PT;             /*0068*/                IADD R4.CC, R4, 0x4;
>
> >         /*0070*/                IADD.X R5, R5, RZ;
>
> >         /*0078*/            @P0 BRA `(.L_1);
> >
> > With widening (12 instructions):
> > .L_1:
>
> >         /*0050*/                IMUL.U32.U32 R6.CC, R4, R4;
>
> >         /*0058*/                IADD R0, R0, -0x1;
>
> >         /*0060*/                IMAD.U32.U32.HI.X R8.CC, R4, R4, RZ;
>
> >         /*0068*/                IMAD.U32.U32.X R8, R5, R4, R8;
>
> >         /*0070*/                IMAD.U32.U32 R7, R4, R5, R8;
>
> >         /*0078*/                IADD R4.CC, R4, 0x1;
>
> >         /*0088*/                ST.E [R2], R6;
>
> >         /*0090*/                IADD.X R5, R5, RZ;
>
> >         /*0098*/                ISETP.NE.AND P0, PT, R0, RZ, PT;
>
> >         /*00a0*/                IADD R2.CC, R2, 0x4;
>
> >         /*00a8*/                IADD.X R3, R3, RZ;
>
> >         /*00b0*/            @P0 BRA `(.L_1);
> >
> > I hope the issue is clear up to this point. So what's a good solution to
> fix this issue? I am thinking of having IndVarSimplify consult
> TargetTransformInfo about the cost
> > of integer arithmetics of different types. If operations on wider
> integer types are more expensive, IndVarSimplify should disable the
> widening.
>
> TargetTransformInfo seems like a good place to put a hook for this.
> You're right that 64-bit integer math will be slower for NVPTX targets, as
> the hardware needs to emulate 64-bit integer ops with 32-bit ops.
>
> How much is register usage affected by this in your benchmarks?
>
> >
> > Another thing I am concerned about: are there other optimizations that
> make similar assumptions about integer widening? Those might cause
> performance regression too just
> > as IndVarSimplify does.
> >
> > Jingyue
> >
> >
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/d77acd42/attachment.html>