<div dir="ltr">Thanks for pointing out. In this case, it seems that vectorization is not actually profitable. Doesn't that mean we always need to sext i32 to i64 so that we can use the whole lanes in a XMM register? Thanks.</div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jun 26, 2015 at 11:56 AM, suyog sarda <span dir="ltr"><<a href="mailto:sardask01@gmail.com" target="_blank">sardask01@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><p dir="ltr"><br>
> For example, I have the following IR code,<br>
><br>
> for.cond.preheader: ; preds = %if.end18<br>
> %mul = mul i32 %12, %3<br>
> %cmp21128 = icmp sgt i32 %mul, 0<br>
> br i1 %cmp21128, label %for.body.preheader, label %return<br>
><br>
> for.body.preheader: ; preds = %for.cond.preheader<br>
> %19 = mul i32 %12, %3<br>
> %20 = add i32 %19, -1<br>
> %21 = zext i32 %20 to i64<br>
> %22 = add i64 %21, 1<br>
> %end.idx = add i64 %21, 1<br>
> %n.vec = and i64 %22,<a href="tel:8589934584" target="_blank"> 8589934584</a><br>
> %cmp.zero = icmp eq i64 %n.vec, 0<br>
> br i1 %cmp.zero, label %middle.block, label %<a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__vector.ph&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=R2OPUbXnmPe8iAz3N804LklAouIYMAGDnZ1N9LgLScU&s=pPvRKa5rXt7W5OqL8nG_Ed0jZAdHOEGeiliLZEJmHL8&e=" target="_blank">vector.ph</a><br>
><br>
> The corresponding assembly code is:<br>
> # BB#3: # %for.cond.preheader <br>
> imull %r9d, %ebx <br>
> testl %ebx, %ebx <br>
> jle .LBB10_63 <br>
> # BB#4: # %for.body.preheader <br>
> leal -1(%rbx), %eax <br>
> incq %rax <br>
> xorl %edx, %edx <br>
> movabsq $8589934584, %rcx # imm = 0x1FFFFFFF8 <br>
> andq %rax, %rcx <br>
> je .LBB10_8 <br>
><br>
> I changed all the scalar operands to <2 x ValueType> ones. The IR becomes the following<br>
> for.cond.preheader: ; preds = %if.end18<br>
> %mulS44_D = mul <2 x i32> %splatLDS24_D.splat, %splatLDS7_D.splat<br>
> %cmp21128S45_D = icmp sgt <2 x i32> %mulS44_D, zeroinitializer<br>
> %sextS46_D = sext <2 x i1> %cmp21128S45_D to <2 x i64><br>
> %BCS46_D = bitcast <2 x i64> %sextS46_D to i128<br>
> %mskS46_D = icmp ne i128 %BCS46_D, 0<br>
> br i1 %mskS46_D, label %for.body.preheader, label %return<br>
><br>
> for.body.preheader: ; preds = %for.cond.preheader<br>
> %S47_D = mul <2 x i32> %splatLDS24_D.splat, %splatLDS7_D.splat<br>
> %S48_D = add <2 x i32> %S47_D, <i32 -1, i32 -1><br>
> %S49_D = zext <2 x i32> %S48_D to <2 x i64><br>
> %S50_D = add <2 x i64> %S49_D, <i64 1, i64 1><br>
> %end.idxS51_D = add <2 x i64> %S49_D, <i64 1, i64 1><br>
> %n.vecS52_D = and <2 x i64> %S50_D, <i64<a href="tel:8589934584" target="_blank"> 8589934584</a>, i64<a href="tel:8589934584" target="_blank"> 8589934584</a>><br>
> %cmp.zeroS53_D = icmp eq <2 x i64> %n.vecS52_D, zeroinitializer<br>
> %sextS54_D = sext <2 x i1> %cmp.zeroS53_D to <2 x i64><br>
> %BCS54_D = bitcast <2 x i64> %sextS54_D to i128<br>
> %mskS54_D = icmp ne i128 %BCS54_D, 0<br>
> br i1 %mskS54_D, label %middle.block, label %<a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__vector.ph&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=R2OPUbXnmPe8iAz3N804LklAouIYMAGDnZ1N9LgLScU&s=pPvRKa5rXt7W5OqL8nG_Ed0jZAdHOEGeiliLZEJmHL8&e=" target="_blank">vector.ph</a><br>
> <br>
> Now the assembly for the above IR code is:<br>
> # BB#4: # %for.cond.preheader<br>
> vmovdqa 144(%rsp), %xmm0 # 16-byte Reload<br>
> vpmuludq %xmm7, %xmm0, %xmm2<br>
> vpsrlq $32, %xmm7, %xmm4<br>
> vpmuludq %xmm4, %xmm0, %xmm4<br>
> vpsllq $32, %xmm4, %xmm4<br>
> vpaddq %xmm4, %xmm2, %xmm2<br>
> vpsrlq $32, %xmm0, %xmm4<br>
> vpmuludq %xmm7, %xmm4, %xmm4<br>
> vpsllq $32, %xmm4, %xmm4<br>
> vpaddq %xmm4, %xmm2, %xmm2<br>
> vpextrq $1, %xmm2, %rax<br>
> cltq<br>
> vmovq %rax, %xmm4<br>
> vmovq %xmm2, %rax<br>
> cltq<br>
> vmovq %rax, %xmm5<br>
> vpunpcklqdq %xmm4, %xmm5, %xmm4 # xmm4 = xmm5[0],xmm4[0]<br>
> vpcmpgtq %xmm3, %xmm4, %xmm3<br>
> vptest %xmm3, %xmm3<br>
> je .LBB10_66<br>
> # BB#5: # %for.body.preheader<br>
> vpaddq %xmm15, %xmm2, %xmm3<br>
> vpand %xmm15, %xmm3, %xmm3<br>
> vpaddq .LCPI10_1(%rip), %xmm3, %xmm8<br>
> vpand .LCPI10_5(%rip), %xmm8, %xmm5<br>
> vpxor %xmm4, %xmm4, %xmm4<br>
> vpcmpeqq %xmm4, %xmm5, %xmm6<br>
> vptest %xmm6, %xmm6<br>
> jne .LBB10_9<br>
></p>
</div></div><p dir="ltr">As Mats pointed out, this may be the same problem as:<br>
<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__llvm.org_bugs_show-5Fbug.cgi-3Fid-3D22703&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=R2OPUbXnmPe8iAz3N804LklAouIYMAGDnZ1N9LgLScU&s=kjDU4dg1qjBO3lD0mKhjdsbQ3_ezZ2n9F7iF3uZK5NM&e=" target="_blank">https</a><a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__llvm.org_bugs_show-5Fbug.cgi-3Fid-3D22703&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=R2OPUbXnmPe8iAz3N804LklAouIYMAGDnZ1N9LgLScU&s=kjDU4dg1qjBO3lD0mKhjdsbQ3_ezZ2n9F7iF3uZK5NM&e=" target="_blank">://</a><a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__llvm.org_bugs_show-5Fbug.cgi-3Fid-3D22703&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=R2OPUbXnmPe8iAz3N804LklAouIYMAGDnZ1N9LgLScU&s=kjDU4dg1qjBO3lD0mKhjdsbQ3_ezZ2n9F7iF3uZK5NM&e=" target="_blank">llvm.org</a><a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__llvm.org_bugs_show-5Fbug.cgi-3Fid-3D22703&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=R2OPUbXnmPe8iAz3N804LklAouIYMAGDnZ1N9LgLScU&s=kjDU4dg1qjBO3lD0mKhjdsbQ3_ezZ2n9F7iF3uZK5NM&e=" target="_blank">/bugs/show_</a><a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__llvm.org_bugs_show-5Fbug.cgi-3Fid-3D22703&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=R2OPUbXnmPe8iAz3N804LklAouIYMAGDnZ1N9LgLScU&s=kjDU4dg1qjBO3lD0mKhjdsbQ3_ezZ2n9F7iF3uZK5NM&e=" target="_blank">bug.cgi</a><a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__llvm.org_bugs_show-5Fbug.cgi-3Fid-3D22703&d=AwMGaQ&c=8hUWFZcy2Z-Za5rBPlktOQ&r=Mfk2qtn1LTDThVkh6-oGglNfMADXfJdty4_bhmuhMHA&m=R2OPUbXnmPe8iAz3N804LklAouIYMAGDnZ1N9LgLScU&s=kjDU4dg1qjBO3lD0mKhjdsbQ3_ezZ2n9F7iF3uZK5NM&e=" target="_blank">?id=22703</a></p>
<p dir="ltr">Basically, the code is generated for AVX2 where register XMM are 128 bits. Some of the above ops are <2 x i32> and involve sext to <2 x i64>, bitcast, etc. Hence the code has extra vector instructions.</p>
<p dir="ltr">Regards,<br>
Suyog Sarda</p>
</blockquote></div><br></div>