<html><head><meta http-equiv="Content-Type" content="text/html charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><div class="">Please see: <a href="http://llvm.org/PR21148" class="">http://llvm.org/PR21148</a></div><div class=""><br class=""></div><div class="">I updated the bug with my suggestion. I hope it works.</div><div class=""><br class=""></div><div class="">-Andy</div><div class=""><br class=""></div><div><blockquote type="cite" class=""><div class="">On Oct 24, 2014, at 11:29 AM, Justin Holewinski <<a href="mailto:jholewinski@nvidia.com" class="">jholewinski@nvidia.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">On Fri, 24 Oct 2014, Jingyue Wu wrote:</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><blockquote type="cite" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">Hi, <br class="">I noticed a significant performance regression (up to 40%) on some internal CUDA benchmarks (a reduced example presented below). The root cause of this regression seems<br class="">that IndVarSimpilfy widens induction variables assuming arithmetics on wider integer types are as cheap as those on narrower ones. However, this assumption is wrong at<br class="">least for the NVPTX64 target. <br class="">Although the NVPTX64 target supports 64-bit arithmetics, since the actual NVIDIA GPU typically has only 32-bit integer registers, one 64-bit arithmetic typically ends up<br class="">with two machine instructions taking care of the low 32 bits and the high 32 bits respectively. I haven't looked at other GPU targets such as R600, but I suspect this<br class="">problem is not restricted to the NVPTX64 target. <br class="">Below is a reduced example:<br class="">__attribute__((global)) void foo(int n, int *output) {<br class=""> <span class="Apple-converted-space"> </span>for (int i = 0; i < n; i += 3) {<br class=""> <span class="Apple-converted-space"> </span>output[i] = i * i;<br class=""> <span class="Apple-converted-space"> </span>}<br class="">}<br class="">Without widening, the loop body in the PTX (a low-level assembly-like language generated by NVPTX64) is:<br class="">BB0_2: // =>This Inner Loop Header: Depth=1 <br class=""> <span class="Apple-converted-space"> </span>mul.lo.s32 %r5, %r6, %r6; <br class=""> <span class="Apple-converted-space"> </span>st.u32 [%rd4], %r5; <br class=""> <span class="Apple-converted-space"> </span>add.s32 %r6, %r6, 3; <br class=""> <span class="Apple-converted-space"> </span>add.s64 %rd4, %rd4, 12; <br class=""> <span class="Apple-converted-space"> </span>setp.lt.s32 %p2, %r6, %r3;<br class=""> <span class="Apple-converted-space"> </span>@%p2 bra BB0_2;<br class="">in which %r6 is the induction variable i. <br class="">With widening, the loop body becomes:<br class="">BB0_2: // =>This Inner Loop Header: Depth=1 <br class=""> <span class="Apple-converted-space"> </span>mul.lo.s64 %rd8, %rd10, %rd10; <br class=""> <span class="Apple-converted-space"> </span>st.u32 [%rd9], %rd8; <br class=""> <span class="Apple-converted-space"> </span>add.s64 %rd10, %rd10, 3; <br class=""> <span class="Apple-converted-space"> </span>add.s64 %rd9, %rd9, 12; <br class=""> <span class="Apple-converted-space"> </span>setp.lt.s64 %p2, %rd10, %rd1; <br class=""> <span class="Apple-converted-space"> </span>@%p2 bra BB0_2;<br class="">Although the number of PTX instructions in both versions are the same, the version with widening uses more mul.lo.s64, add.s64, and setp.lt.s64 instructions which are<br class="">more expensive than their 32-bit counterparts. Indeed, the SASS code (disassembly of the actual machine code running on GPUs) of the version with widening looks<br class="">significantly longer. <br class="">Without widening (7 instructions): <br class="">.L_1: <br class=""> <span class="Apple-converted-space"> </span>/*0048*/ IMUL R2, R0, R0; <br class=""> <span class="Apple-converted-space"> </span>/*0050*/ IADD R0, R0, 0x1; <br class=""> <span class="Apple-converted-space"> </span>/*0058*/ ST.E [R4], R2; <br class=""> <span class="Apple-converted-space"> </span>/*0060*/ ISETP.NE.AND P0, PT, R0, c[0x0][0x140], PT; /*0068*/ IADD R4.CC, R4, 0x4; <br class=""> <span class="Apple-converted-space"> </span>/*0070*/ IADD.X R5, R5, RZ; <br class=""> <span class="Apple-converted-space"> </span>/*0078*/ @P0 BRA `(.L_1);<br class="">With widening (12 instructions):<br class="">.L_1: <br class=""> <span class="Apple-converted-space"> </span>/*0050*/ IMUL.U32.U32 R6.CC, R4, R4; <br class=""> <span class="Apple-converted-space"> </span>/*0058*/ IADD R0, R0, -0x1; <br class=""> <span class="Apple-converted-space"> </span>/*0060*/ IMAD.U32.U32.HI.X R8.CC, R4, R4, RZ; <br class=""> <span class="Apple-converted-space"> </span>/*0068*/ IMAD.U32.U32.X R8, R5, R4, R8; <br class=""> <span class="Apple-converted-space"> </span>/*0070*/ IMAD.U32.U32 R7, R4, R5, R8; <br class=""> <span class="Apple-converted-space"> </span>/*0078*/ IADD R4.CC, R4, 0x1; <br class=""> <span class="Apple-converted-space"> </span>/*0088*/ ST.E [R2], R6; <br class=""> <span class="Apple-converted-space"> </span>/*0090*/ IADD.X R5, R5, RZ; <br class=""> <span class="Apple-converted-space"> </span>/*0098*/ ISETP.NE.AND P0, PT, R0, RZ, PT; <br class=""> <span class="Apple-converted-space"> </span>/*00a0*/ IADD R2.CC, R2, 0x4; <br class=""> <span class="Apple-converted-space"> </span>/*00a8*/ IADD.X R3, R3, RZ; <br class=""> <span class="Apple-converted-space"> </span>/*00b0*/ @P0 BRA `(.L_1);<br class="">I hope the issue is clear up to this point. So what's a good solution to fix this issue? I am thinking of having IndVarSimplify consult TargetTransformInfo about the cost<br class="">of integer arithmetics of different types. If operations on wider integer types are more expensive, IndVarSimplify should disable the widening. <br class=""></blockquote><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">TargetTransformInfo seems like a good place to put a hook for this. You're right that 64-bit integer math will be slower for NVPTX targets, as the hardware needs to emulate 64-bit integer ops with 32-bit ops.</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">How much is register usage affected by this in your benchmarks?</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><blockquote type="cite" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">Another thing I am concerned about: are there other optimizations that make similar assumptions about integer widening? Those might cause performance regression too just<br class="">as IndVarSimplify does. <br class="">Jingyue</blockquote></div></blockquote></div><br class=""></body></html>