[LLVMdev] IndVar widening in IndVarSimplify causing performance regression on GPU programs
Andrew Trick
atrick at apple.com
Fri Oct 24 15:15:37 PDT 2014
Please see: http://llvm.org/PR21148 <http://llvm.org/PR21148>
I updated the bug with my suggestion. I hope it works.
-Andy
> On Oct 24, 2014, at 11:29 AM, Justin Holewinski <jholewinski at nvidia.com> wrote:
>
> On Fri, 24 Oct 2014, Jingyue Wu wrote:
>
>> Hi,
>> I noticed a significant performance regression (up to 40%) on some internal CUDA benchmarks (a reduced example presented below). The root cause of this regression seems
>> that IndVarSimpilfy widens induction variables assuming arithmetics on wider integer types are as cheap as those on narrower ones. However, this assumption is wrong at
>> least for the NVPTX64 target.
>> Although the NVPTX64 target supports 64-bit arithmetics, since the actual NVIDIA GPU typically has only 32-bit integer registers, one 64-bit arithmetic typically ends up
>> with two machine instructions taking care of the low 32 bits and the high 32 bits respectively. I haven't looked at other GPU targets such as R600, but I suspect this
>> problem is not restricted to the NVPTX64 target.
>> Below is a reduced example:
>> __attribute__((global)) void foo(int n, int *output) {
>> for (int i = 0; i < n; i += 3) {
>> output[i] = i * i;
>> }
>> }
>> Without widening, the loop body in the PTX (a low-level assembly-like language generated by NVPTX64) is:
>> BB0_2: // =>This Inner Loop Header: Depth=1
>> mul.lo.s32 %r5, %r6, %r6;
>> st.u32 [%rd4], %r5;
>> add.s32 %r6, %r6, 3;
>> add.s64 %rd4, %rd4, 12;
>> setp.lt.s32 %p2, %r6, %r3;
>> @%p2 bra BB0_2;
>> in which %r6 is the induction variable i.
>> With widening, the loop body becomes:
>> BB0_2: // =>This Inner Loop Header: Depth=1
>> mul.lo.s64 %rd8, %rd10, %rd10;
>> st.u32 [%rd9], %rd8;
>> add.s64 %rd10, %rd10, 3;
>> add.s64 %rd9, %rd9, 12;
>> setp.lt.s64 %p2, %rd10, %rd1;
>> @%p2 bra BB0_2;
>> Although the number of PTX instructions in both versions are the same, the version with widening uses more mul.lo.s64, add.s64, and setp.lt.s64 instructions which are
>> more expensive than their 32-bit counterparts. Indeed, the SASS code (disassembly of the actual machine code running on GPUs) of the version with widening looks
>> significantly longer.
>> Without widening (7 instructions):
>> .L_1:
>> /*0048*/ IMUL R2, R0, R0;
>> /*0050*/ IADD R0, R0, 0x1;
>> /*0058*/ ST.E [R4], R2;
>> /*0060*/ ISETP.NE.AND P0, PT, R0, c[0x0][0x140], PT; /*0068*/ IADD R4.CC, R4, 0x4;
>> /*0070*/ IADD.X R5, R5, RZ;
>> /*0078*/ @P0 BRA `(.L_1);
>> With widening (12 instructions):
>> .L_1:
>> /*0050*/ IMUL.U32.U32 R6.CC, R4, R4;
>> /*0058*/ IADD R0, R0, -0x1;
>> /*0060*/ IMAD.U32.U32.HI.X R8.CC, R4, R4, RZ;
>> /*0068*/ IMAD.U32.U32.X R8, R5, R4, R8;
>> /*0070*/ IMAD.U32.U32 R7, R4, R5, R8;
>> /*0078*/ IADD R4.CC, R4, 0x1;
>> /*0088*/ ST.E [R2], R6;
>> /*0090*/ IADD.X R5, R5, RZ;
>> /*0098*/ ISETP.NE.AND P0, PT, R0, RZ, PT;
>> /*00a0*/ IADD R2.CC, R2, 0x4;
>> /*00a8*/ IADD.X R3, R3, RZ;
>> /*00b0*/ @P0 BRA `(.L_1);
>> I hope the issue is clear up to this point. So what's a good solution to fix this issue? I am thinking of having IndVarSimplify consult TargetTransformInfo about the cost
>> of integer arithmetics of different types. If operations on wider integer types are more expensive, IndVarSimplify should disable the widening.
>
> TargetTransformInfo seems like a good place to put a hook for this. You're right that 64-bit integer math will be slower for NVPTX targets, as the hardware needs to emulate 64-bit integer ops with 32-bit ops.
>
> How much is register usage affected by this in your benchmarks?
>
>> Another thing I am concerned about: are there other optimizations that make similar assumptions about integer widening? Those might cause performance regression too just
>> as IndVarSimplify does.
>> Jingyue
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141024/6cbfcca1/attachment.html>
More information about the llvm-dev
mailing list