Thanks for the pointer, Andrew! I didn't realize there's a report already, and I'll look into that. <br><br><div class="gmail_quote">On Fri Oct 24 2014 at 3:15:38 PM Andrew Trick <<a href="mailto:atrick@apple.com">atrick@apple.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><div>Please see: <a href="http://llvm.org/PR21148" target="_blank">http://llvm.org/PR21148</a></div><div><br></div><div>I updated the bug with my suggestion. I hope it works.</div><div><br></div><div>-Andy</div></div><div style="word-wrap:break-word"><div><br></div><div><blockquote type="cite"><div>On Oct 24, 2014, at 11:29 AM, Justin Holewinski <<a href="mailto:jholewinski@nvidia.com" target="_blank">jholewinski@nvidia.com</a>> wrote:</div><br><div><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:none;display:inline!important">On Fri, 24 Oct 2014, Jingyue Wu wrote:</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><blockquote type="cite" style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">Hi, <br>I noticed a significant performance regression (up to 40%) on some internal CUDA benchmarks (a reduced example presented below). The root cause of this regression seems<br>that IndVarSimpilfy widens induction variables assuming arithmetics on wider integer types are as cheap as those on narrower ones. However, this assumption is wrong at<br>least for the NVPTX64 target. <br>Although the NVPTX64 target supports 64-bit arithmetics, since the actual NVIDIA GPU typically has only 32-bit integer registers, one 64-bit arithmetic typically ends up<br>with two machine instructions taking care of the low 32 bits and the high 32 bits respectively. I haven't looked at other GPU targets such as R600, but I suspect this<br>problem is not restricted to the NVPTX64 target. <br>Below is a reduced example:<br>__attribute__((global)) void foo(int n, int *output) {<br> <span> </span>for (int i = 0; i < n; i += 3) {<br>   <span> </span>output[i] = i * i;<br> <span> </span>}<br>}<br>Without widening, the loop body in the PTX (a low-level assembly-like language generated by NVPTX64) is:<br>BB0_2:                                  // =>This Inner Loop Header: Depth=1        <br>       <span> </span>mul.lo.s32      %r5, %r6, %r6;                                              <br>       <span> </span>st.u32  [%rd4], %r5;                                                        <br>       <span> </span>add.s32         %r6, %r6, 3;                                                <br>       <span> </span>add.s64         %rd4, %rd4, 12;                                              <br>       <span> </span>setp.lt.s32     %p2, %r6, %r3;<br>       <span> </span>@%p2 bra        BB0_2;<br>in which %r6 is the induction variable i. <br>With widening, the loop body becomes:<br>BB0_2:                                  // =>This Inner Loop Header: Depth=1        <br>       <span> </span>mul.lo.s64      %rd8, %rd10, %rd10;                                         <br>       <span> </span>st.u32  [%rd9], %rd8;                                                         <br>       <span> </span>add.s64         %rd10, %rd10, 3;                                            <br>       <span> </span>add.s64         %rd9, %rd9, 12;                                             <br>       <span> </span>setp.lt.s64     %p2, %rd10, %rd1;                                           <br>       <span> </span>@%p2 bra        BB0_2;<br>Although the number of PTX instructions in both versions are the same, the version with widening uses more mul.lo.s64, add.s64, and setp.lt.s64 instructions which are<br>more expensive than their 32-bit counterparts. Indeed, the SASS code (disassembly of the actual machine code running on GPUs) of the version with widening looks<br>significantly longer. <br>Without widening (7 instructions): <br>.L_1:                                                                               <br>       <span> </span>/*0048*/                IMUL R2, R0, R0;                                      <br>       <span> </span>/*0050*/                IADD R0, R0, 0x1;                                   <br>       <span> </span>/*0058*/                ST.E [R4], R2;                                      <br>       <span> </span>/*0060*/                ISETP.NE.AND P0, PT, R0, c[0x0][0x140], PT;             /*0068*/                IADD R4.CC, R4, 0x4;                                <br>       <span> </span>/*0070*/                IADD.X R5, R5, RZ;                                  <br>       <span> </span>/*0078*/            @P0 BRA `(.L_1);<br>With widening (12 instructions):<br>.L_1:                                                                            <br>       <span> </span>/*0050*/                IMUL.U32.U32 R6.CC, R4, R4;                      <br>       <span> </span>/*0058*/                IADD R0, R0, -0x1;                                    <br>       <span> </span>/*0060*/                IMAD.U32.U32.HI.X R8.CC, R4, R4, RZ;             <br>       <span> </span>/*0068*/                IMAD.U32.U32.X R8, R5, R4, R8;                   <br>       <span> </span>/*0070*/                IMAD.U32.U32 R7, R4, R5, R8;                     <br>       <span> </span>/*0078*/                IADD R4.CC, R4, 0x1;                             <br>       <span> </span>/*0088*/                ST.E [R2], R6;                                   <br>       <span> </span>/*0090*/                IADD.X R5, R5, RZ;                               <br>       <span> </span>/*0098*/                ISETP.NE.AND P0, PT, R0, RZ, PT;                 <br>       <span> </span>/*00a0*/                IADD R2.CC, R2, 0x4;                             <br>       <span> </span>/*00a8*/                IADD.X R3, R3, RZ;                                  <br>       <span> </span>/*00b0*/            @P0 BRA `(.L_1);<br>I hope the issue is clear up to this point. So what's a good solution to fix this issue? I am thinking of having IndVarSimplify consult TargetTransformInfo about the cost<br>of integer arithmetics of different types. If operations on wider integer types are more expensive, IndVarSimplify should disable the widening. <br></blockquote><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:none;display:inline!important">TargetTransformInfo seems like a good place to put a hook for this. You're right that 64-bit integer math will be slower for NVPTX targets, as the hardware needs to emulate 64-bit integer ops with 32-bit ops.</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:none;display:inline!important">How much is register usage affected by this in your benchmarks?</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><blockquote type="cite" style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">Another thing I am concerned about: are there other optimizations that make similar assumptions about integer widening? Those might cause performance regression too just<br>as IndVarSimplify does. <br>Jingyue</blockquote></div></blockquote></div><br></div></blockquote></div>