<div dir="ltr"><div><div>------------------------------------ IR ------------------------------------------------------------------</div><div>if.then.i.i.i.i.i.i:                              ; preds = %if.then4</div><div>  %S25_D = zext <2 x i32> %splatLDS17_D.splat to <2 x i64></div><div>  %umul_with_overflow.i.iS26_D = shl <2 x i64> %S25_D, <i64 3, i64 3></div><div>  %extumul_with_overflow.i.iS26_D = extractelement <2 x i64> %umul_with_overflow.i.iS26_D, i32 1</div><div>  %call5.i.i = tail call noalias i8* @_Znam(i64 %extumul_with_overflow.i.iS26_D) #22</div><div>  %splatCallS27_D.splatinsert = insertelement <2 x i8*> undef, i8* %call5.i.i, i32 0</div><div>  %splatCallS27_D.splat = shufflevector <2 x i8*> %splatCallS27_D.splatinsert, <2 x i8*> undef, <2 x i32> zeroinitializer</div><div>  %bitcastS28_D = bitcast <2 x i8*> %splatCallS27_D.splat to <2 x double*></div><div>  %extractS29_D = extractelement <2 x double*> %bitcastS28_D, i32 1</div><div>  store double* %extractS29_D, double** %val.i.i, align 8</div><div>  %val.i3.i.i = getelementptr inbounds %class.Vector* %__x, i64 0, i32 3</div><div>  %4 = load double** %val.i3.i.i, align 8, !tbaa !22</div><div>  %splatLDS31_D.splatinsert = insertelement <2 x double*> undef, double* %4, i32 0</div><div>  %splatLDS31_D.splat = shufflevector <2 x double*> %splatLDS31_D.splatinsert, <2 x double*> undef, <2 x i32> zeroinitializer</div><div>  %bitcastS32_D = bitcast <2 x double*> %splatLDS31_D.splat to <2 x i8*></div><div>  %extbitcastS32_D = extractelement <2 x i8*> %bitcastS32_D, i32 1</div><div>  tail call void @llvm.memmove.p0i8.p0i8.i64(i8* %call5.i.i, i8* %extbitcastS32_D, i64 %extumul_with_overflow.i.iS26_D, i32 8, i1 false) #9</div><div>  br label %invoke.cont</div><div><br></div><div>invoke.cont:                                      ; preds = %if.then.i.i.i.i.i.i, %if.then4</div><div>  %sub.ptr.rhs.cast.i = ptrtoint %class.Vector* %__position.coerce to i64</div><div>  %sub.ptr.rhs.cast.iS35_D = ptrtoint <2 x %class.Vector*> %splatInsMapS35_D.splat to <2 x i64></div><div>  %sub.ptr.sub.iS36_D = sub <2 x i64> %sub.ptr.rhs.castS8_D, %sub.ptr.rhs.cast.iS35_D</div><div>  %sub.ptr.div.iS37_D = sdiv <2 x i64> %sub.ptr.sub.iS36_D, <i64 24, i64 24></div><div>  %extractS196_D = extractelement <2 x i64> %sub.ptr.div.iS37_D, i32 1</div><div>  %cmp10S38_D = icmp ugt <2 x i64> %sub.ptr.div.iS37_D, %splatInsMapS1_D.splat</div><div>  %zextS39_D = sext <2 x i1> %cmp10S38_D to <2 x i64></div><div>  %BCS39_D = bitcast <2 x i64> %zextS39_D to i128</div><div>  %mskS39_D = icmp ne i128 %BCS39_D, 0</div><div>  br i1 %mskS39_D, label %if.then11, label %if.else</div></div><div><br></div><div>-------------------------------------------- Assembly -----------------------------------------------------------------</div><div><br></div><div><div># BB#3:                                 # %if.then.i.i.i.i.i.i</div><div>    vpsllq  $3, %xmm0, %xmm0</div><div>    vpextrq $1, %xmm0, %rbx</div><div>    movq    %rbx, %rdi</div><div>    vmovaps %xmm2, 96(%rsp)         # 16-byte Spill</div><div>    vmovaps %xmm5, 64(%rsp)         # 16-byte Spill</div><div>    vmovdqa %xmm6, 16(%rsp)         # 16-byte Spill</div><div>    callq   _Znam</div><div>    movq    %rax, 128(%rsp)</div><div>    movq    16(%r12), %rsi</div><div>    movq    %rax, %rdi</div><div>    movq    %rbx, %rdx</div><div>    callq   memmove</div><div>    vmovdqa 16(%rsp), %xmm6         # 16-byte Reload</div><div>    vmovaps 64(%rsp), %xmm5         # 16-byte Reload</div><div>    vmovaps 96(%rsp), %xmm2         # 16-byte Reload</div><div>    vmovdqa .LCPI582_0(%rip), %xmm4</div><div>.LBB582_4:                              # %invoke.cont</div><div>    vmovaps %xmm2, 96(%rsp)         # 16-byte Spill</div><div>    vmovdqa 48(%rsp), %xmm0         # 16-byte Reload</div><div>    vpsubq  %xmm0, %xmm2, %xmm0</div><div>    vpextrq $1, %xmm0, %rax</div><div>    movabsq $3074457345618258603, %rcx # imm = 0x2AAAAAAAAAAAAAAB</div><div>    imulq   %rcx</div><div>    movq    %rdx, %rax</div><div>    shrq    $63, %rax</div><div>    sarq    $2, %rdx</div><div>    addq    %rax, %rdx</div><div>    vmovq   %rdx, %xmm1</div><div>    vmovq   %xmm0, %rax</div><div>    imulq   %rcx</div><div>    movq    %rdx, %rax</div><div>    shrq    $63, %rax</div><div>    sarq    $2, %rdx</div><div>    addq    %rax, %rdx</div><div>    vmovq   %rdx, %xmm0</div></div><div><div>    vpunpcklqdq %xmm1, %xmm0, %xmm1 # xmm1 = xmm0[0],xmm1[0]</div><div>    vpxor   %xmm4, %xmm1, %xmm0</div><div>    vpcmpgtq    %xmm6, %xmm0, %xmm0</div><div>    vptest  %xmm0, %xmm0</div><div>    je  .LBB582_49</div></div><div><br></div><div>Thanks,</div><div>Zhi</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jul 24, 2015 at 10:16 AM, Philip Reames <span dir="ltr"><<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

<br>

On 07/24/2015 03:42 AM, Benjamin Kramer wrote:<div><div class="h5"><br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

On 24.07.2015, at 08:06, zhi chen <<a href="mailto:zchenhn@gmail.com" target="_blank">zchenhn@gmail.com</a>> wrote:<br>

<br>

It seems that that it's hard to vectorize int64 in LLVM. For example, LLVM 3.4 generates very complicated code for the following IR. I am running on a Haswell processor. Is it because there is no alternative AVX/2 instructions for int64? The same thing also happens to zext <2 x i32> -> <2 x i64> and trunc <2 x i64> -> <2 x i32>. Any ideas to optimize these instructions? Thanks.<br>

<br>

%sub.ptr.sub.i6.i.i.i.i = sub <2 x i64> %sub.ptr.lhs.cast.i4.i.i.i.i, %sub.ptr.rhs.cast.i5.i.i.i.i<br>

%sub.ptr.div.i7.i.i.i.i = sdiv <2 x i64> %sub.ptr.sub.i6.i.i.i.i, <i64 24, i64 24><br>

<br>

Assembly:<br>

     vpsubq  %xmm6, %xmm5, %xmm5<br>

     vmovq   %xmm5, %rax<br>

     movabsq $3074457345618258603, %rbx # imm = 0x2AAAAAAAAAAAAAAB<br>

     imulq   %rbx<br>

     movq    %rdx, %rcx<br>

     movq    %rcx, %rax<br>

     shrq    $63, %rax<br>

     shrq    $2, %rcx<br>

     addl    %eax, %ecx<br>

     vpextrq $1, %xmm5, %rax<br>

     imulq   %rbx<br>

     movq    %rdx, %rax<br>

     shrq    $63, %rax<br>

     shrq    $2, %rdx<br>

     addl    %eax, %edx<br>

     movslq  %edx, %rax<br>

     vmovq   %rax, %xmm5<br>

     movslq  %ecx, %rax<br>

     vmovq   %rax, %xmm6<br>

     vpunpcklqdq %xmm5, %xmm6, %xmm5 # xmm5 = xmm6[0],xmm5[0]<br>

</blockquote>

AVX2 doesn't have integer vector division instructions and LLVM lowers divides by constants into (128 bit) multiplies. However, AVX2 doesn't have a way to get to the upper 64 bits of a 64x64->128 bit multiply either, so LLVM uses the scalar imulq instruction to do that. There's not much room to optimize here given the limitations of AVX2.<br>

<br>

You seem to be subtracting pointers though, so if you can guarantee that the pointers are aligned you could set the exact bit on your 'sdiv' instruction. That should give better code.<br>

</blockquote></div></div>

Depending on what you're using the result of the divide for, there might be optimizations which could be applied as well.  Can you give a slightly larger context for your source IR?  (1-2 level of uses/defs out from the instructions would help)<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

- Ben<br>

<br>

<br>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a>         <a href="http://llvm.cs.uiuc.edu" rel="noreferrer" target="_blank">http://llvm.cs.uiuc.edu</a><br>

<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" rel="noreferrer" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

</blockquote>

<br>

</blockquote></div><br></div>