<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    This snippet of IR is interesting:<br>

    <div>  %sub.ptr.div.iS37_D = sdiv <2 x i64>

      %sub.ptr.sub.iS36_D, <i64 24, i64 24></div>

    <div>  %cmp10S38_D = icmp ugt <2 x i64> %sub.ptr.div.iS37_D,

      %splatInsMapS1_D.splat</div>

    <div>  %zextS39_D = sext <2 x i1> %cmp10S38_D to <2 x

      i64></div>

    <div>  %BCS39_D = bitcast <2 x i64> %zextS39_D to i128</div>

    <div>  %mskS39_D = icmp ne i128 %BCS39_D, 0</div>

    <div>  br i1 %mskS39_D, label %if.then11, label %if.else<br>

      <br>

      It looks like %msk539_D is basically a test of whether either of

      the vector elements produced by the divide are ugt the

      spatInstMap.  I can't say for sure that we can do better here - I

      haven't studied our vector canonicalization rules enough - but

      this seems like something which could possibly be improved.  <br>

      <br>

      This is interesting:<br>

      <div>  %splatCallS27_D.splatinsert = insertelement <2 x i8*>

        undef, i8* %call5.i.i, i32 0</div>

      <div>  %splatCallS27_D.splat = shufflevector <2 x i8*>

        %splatCallS27_D.splatinsert, <2 x i8*> undef, <2 x

        i32> zeroinitializer</div>

      <br>

      Can't that shuifflevector be replaced with:<br>

        %splatCallS27_D.splat = insertelement <2 x i8*>

      %splatCallS27_D.splatinsert , i8* %call5.i.i, i32 1<br>

      <br>

      Again, without knowledge of how we canonicalize such things, not

      necessarily a win.  Just suspicious.  <br>

      <br>

      The bitcast/extractelement sequence following that shufflevector

      is also interesting.  It looks like it could be rewritten in terms

      of the i8* %call5.i.i and a bitcast.  <br>

    </div>

    <br>

    <br>

    <div class="moz-cite-prefix">On 07/24/2015 10:52 AM, zhi chen wrote:<br>

    </div>

    <blockquote

cite="mid:CADq53TZELP7O18+68iW+Y2kvzZ2mN5S6roxVFd0U2dZZfRfFvQ@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>

          <div>------------------------------------ IR

            ------------------------------------------------------------------</div>

          <div>if.then.i.i.i.i.i.i:                              ; preds

            = %if.then4</div>

          <div>  %S25_D = zext <2 x i32> %splatLDS17_D.splat to

            <2 x i64></div>

          <div>  %umul_with_overflow.i.iS26_D = shl <2 x i64>

            %S25_D, <i64 3, i64 3></div>

          <div>  %extumul_with_overflow.i.iS26_D = extractelement <2

            x i64> %umul_with_overflow.i.iS26_D, i32 1</div>

          <div>  %call5.i.i = tail call noalias i8* @_Znam(i64

            %extumul_with_overflow.i.iS26_D) #22</div>

          <div>  %splatCallS27_D.splatinsert = insertelement <2 x

            i8*> undef, i8* %call5.i.i, i32 0</div>

          <div>  %splatCallS27_D.splat = shufflevector <2 x i8*>

            %splatCallS27_D.splatinsert, <2 x i8*> undef, <2 x

            i32> zeroinitializer</div>

          <div>  %bitcastS28_D = bitcast <2 x i8*>

            %splatCallS27_D.splat to <2 x double*></div>

          <div>  %extractS29_D = extractelement <2 x double*>

            %bitcastS28_D, i32 1</div>

          <div>  store double* %extractS29_D, double** %val.i.i, align 8</div>

          <div>  %val.i3.i.i = getelementptr inbounds %class.Vector*

            %__x, i64 0, i32 3</div>

          <div>  %4 = load double** %val.i3.i.i, align 8, !tbaa !22</div>

          <div>  %splatLDS31_D.splatinsert = insertelement <2 x

            double*> undef, double* %4, i32 0</div>

          <div>  %splatLDS31_D.splat = shufflevector <2 x double*>

            %splatLDS31_D.splatinsert, <2 x double*> undef, <2

            x i32> zeroinitializer</div>

          <div>  %bitcastS32_D = bitcast <2 x double*>

            %splatLDS31_D.splat to <2 x i8*></div>

          <div>  %extbitcastS32_D = extractelement <2 x i8*>

            %bitcastS32_D, i32 1</div>

          <div>  tail call void @llvm.memmove.p0i8.p0i8.i64(i8*

            %call5.i.i, i8* %extbitcastS32_D, i64

            %extumul_with_overflow.i.iS26_D, i32 8, i1 false) #9</div>

          <div>  br label %invoke.cont</div>

          <div><br>

          </div>

          <div>invoke.cont:                                      ; preds

            = %if.then.i.i.i.i.i.i, %if.then4</div>

          <div>  %sub.ptr.rhs.cast.i = ptrtoint %class.Vector*

            %__position.coerce to i64</div>

          <div>  %sub.ptr.rhs.cast.iS35_D = ptrtoint <2 x

            %class.Vector*> %splatInsMapS35_D.splat to <2 x

            i64></div>

          <div>  %sub.ptr.sub.iS36_D = sub <2 x i64>

            %sub.ptr.rhs.castS8_D, %sub.ptr.rhs.cast.iS35_D</div>

          <div>  %sub.ptr.div.iS37_D = sdiv <2 x i64>

            %sub.ptr.sub.iS36_D, <i64 24, i64 24></div>

          <div>  %extractS196_D = extractelement <2 x i64>

            %sub.ptr.div.iS37_D, i32 1</div>

          <div>  %cmp10S38_D = icmp ugt <2 x i64>

            %sub.ptr.div.iS37_D, %splatInsMapS1_D.splat</div>

          <div>  %zextS39_D = sext <2 x i1> %cmp10S38_D to <2 x

            i64></div>

          <div>  %BCS39_D = bitcast <2 x i64> %zextS39_D to i128</div>

          <div>  %mskS39_D = icmp ne i128 %BCS39_D, 0</div>

          <div>  br i1 %mskS39_D, label %if.then11, label %if.else</div>

        </div>

        <div><br>

        </div>

        <div>-------------------------------------------- Assembly

          -----------------------------------------------------------------</div>

        <div><br>

        </div>

        <div>

          <div># BB#3:                                 #

            %if.then.i.i.i.i.i.i</div>

          <div>    vpsllq  $3, %xmm0, %xmm0</div>

          <div>    vpextrq $1, %xmm0, %rbx</div>

          <div>    movq    %rbx, %rdi</div>

          <div>    vmovaps %xmm2, 96(%rsp)         # 16-byte Spill</div>

          <div>    vmovaps %xmm5, 64(%rsp)         # 16-byte Spill</div>

          <div>    vmovdqa %xmm6, 16(%rsp)         # 16-byte Spill</div>

          <div>    callq   _Znam</div>

          <div>    movq    %rax, 128(%rsp)</div>

          <div>    movq    16(%r12), %rsi</div>

          <div>    movq    %rax, %rdi</div>

          <div>    movq    %rbx, %rdx</div>

          <div>    callq   memmove</div>

          <div>    vmovdqa 16(%rsp), %xmm6         # 16-byte Reload</div>

          <div>    vmovaps 64(%rsp), %xmm5         # 16-byte Reload</div>

          <div>    vmovaps 96(%rsp), %xmm2         # 16-byte Reload</div>

          <div>    vmovdqa .LCPI582_0(%rip), %xmm4</div>

          <div>.LBB582_4:                              # %invoke.cont</div>

          <div>    vmovaps %xmm2, 96(%rsp)         # 16-byte Spill</div>

          <div>    vmovdqa 48(%rsp), %xmm0         # 16-byte Reload</div>

          <div>    vpsubq  %xmm0, %xmm2, %xmm0</div>

          <div>    vpextrq $1, %xmm0, %rax</div>

          <div>    movabsq $3074457345618258603, %rcx # imm =

            0x2AAAAAAAAAAAAAAB</div>

          <div>    imulq   %rcx</div>

          <div>    movq    %rdx, %rax</div>

          <div>    shrq    $63, %rax</div>

          <div>    sarq    $2, %rdx</div>

          <div>    addq    %rax, %rdx</div>

          <div>    vmovq   %rdx, %xmm1</div>

          <div>    vmovq   %xmm0, %rax</div>

          <div>    imulq   %rcx</div>

          <div>    movq    %rdx, %rax</div>

          <div>    shrq    $63, %rax</div>

          <div>    sarq    $2, %rdx</div>

          <div>    addq    %rax, %rdx</div>

          <div>    vmovq   %rdx, %xmm0</div>

        </div>

        <div>

          <div>    vpunpcklqdq %xmm1, %xmm0, %xmm1 # xmm1 =

            xmm0[0],xmm1[0]</div>

          <div>    vpxor   %xmm4, %xmm1, %xmm0</div>

          <div>    vpcmpgtq    %xmm6, %xmm0, %xmm0</div>

          <div>    vptest  %xmm0, %xmm0</div>

          <div>    je  .LBB582_49</div>

        </div>

        <div><br>

        </div>

        <div>Thanks,</div>

        <div>Zhi</div>

      </div>

      <div class="gmail_extra"><br>

        <div class="gmail_quote">On Fri, Jul 24, 2015 at 10:16 AM,

          Philip Reames <span dir="ltr"><<a moz-do-not-send="true"

              href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span>

          wrote:<br>

          <blockquote class="gmail_quote" style="margin:0 0 0

            .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>

            <br>

            On 07/24/2015 03:42 AM, Benjamin Kramer wrote:

            <div>

              <div class="h5"><br>

                <blockquote class="gmail_quote" style="margin:0 0 0

                  .8ex;border-left:1px #ccc solid;padding-left:1ex">

                  <blockquote class="gmail_quote" style="margin:0 0 0

                    .8ex;border-left:1px #ccc solid;padding-left:1ex">

                    On 24.07.2015, at 08:06, zhi chen <<a

                      moz-do-not-send="true"

                      href="mailto:zchenhn@gmail.com" target="_blank">zchenhn@gmail.com</a>>

                    wrote:<br>

                    <br>

                    It seems that that it's hard to vectorize int64 in

                    LLVM. For example, LLVM 3.4 generates very

                    complicated code for the following IR. I am running

                    on a Haswell processor. Is it because there is no

                    alternative AVX/2 instructions for int64? The same

                    thing also happens to zext <2 x i32> ->

                    <2 x i64> and trunc <2 x i64> ->

                    <2 x i32>. Any ideas to optimize these

                    instructions? Thanks.<br>

                    <br>

                    %sub.ptr.sub.i6.i.i.i.i = sub <2 x i64>

                    %sub.ptr.lhs.cast.i4.i.i.i.i,

                    %sub.ptr.rhs.cast.i5.i.i.i.i<br>

                    %sub.ptr.div.i7.i.i.i.i = sdiv <2 x i64>

                    %sub.ptr.sub.i6.i.i.i.i, <i64 24, i64 24><br>

                    <br>

                    Assembly:<br>

                         vpsubq  %xmm6, %xmm5, %xmm5<br>

                         vmovq   %xmm5, %rax<br>

                         movabsq $3074457345618258603, %rbx # imm =

                    0x2AAAAAAAAAAAAAAB<br>

                         imulq   %rbx<br>

                         movq    %rdx, %rcx<br>

                         movq    %rcx, %rax<br>

                         shrq    $63, %rax<br>

                         shrq    $2, %rcx<br>

                         addl    %eax, %ecx<br>

                         vpextrq $1, %xmm5, %rax<br>

                         imulq   %rbx<br>

                         movq    %rdx, %rax<br>

                         shrq    $63, %rax<br>

                         shrq    $2, %rdx<br>

                         addl    %eax, %edx<br>

                         movslq  %edx, %rax<br>

                         vmovq   %rax, %xmm5<br>

                         movslq  %ecx, %rax<br>

                         vmovq   %rax, %xmm6<br>

                         vpunpcklqdq %xmm5, %xmm6, %xmm5 # xmm5 =

                    xmm6[0],xmm5[0]<br>

                  </blockquote>

                  AVX2 doesn't have integer vector division instructions

                  and LLVM lowers divides by constants into (128 bit)

                  multiplies. However, AVX2 doesn't have a way to get to

                  the upper 64 bits of a 64x64->128 bit multiply

                  either, so LLVM uses the scalar imulq instruction to

                  do that. There's not much room to optimize here given

                  the limitations of AVX2.<br>

                  <br>

                  You seem to be subtracting pointers though, so if you

                  can guarantee that the pointers are aligned you could

                  set the exact bit on your 'sdiv' instruction. That

                  should give better code.<br>

                </blockquote>

              </div>

            </div>

            Depending on what you're using the result of the divide for,

            there might be optimizations which could be applied as

            well.  Can you give a slightly larger context for your

            source IR?  (1-2 level of uses/defs out from the

            instructions would help)<br>

            <blockquote class="gmail_quote" style="margin:0 0 0

              .8ex;border-left:1px #ccc solid;padding-left:1ex">

              <br>

              - Ben<br>

              <br>

              <br>

              _______________________________________________<br>

              LLVM Developers mailing list<br>

              <a moz-do-not-send="true"

                href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a> 

                     <a moz-do-not-send="true"

                href="http://llvm.cs.uiuc.edu" rel="noreferrer"

                target="_blank">http://llvm.cs.uiuc.edu</a><br>

              <a moz-do-not-send="true"

                href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev"

                rel="noreferrer" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>

            </blockquote>

            <br>

          </blockquote>

        </div>

        <br>

      </div>

    </blockquote>

    <br>

  </body>

</html>