<div dir="ltr">Thanks for your help, Philip. The shufflevector could be replaced with insertelement, but I think that shouldn't improve the performance much. I think shufflevector would generate a broadcast or a movlhps/d instruction. <div>The performance bottleneck here is because AVX2 doesn't have a SIMD division for integers. It needs to puck and unpack the vector register and use many shift instructions. It also might use two int division instructions. </div><div><br></div><div>Thanks,</div><div>Zhi <br><div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Jul 24, 2015 at 11:32 AM, Philip Reames <span dir="ltr"><<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<div bgcolor="#FFFFFF" text="#000000">
This snippet of IR is interesting:<span class=""><br>
<div> %sub.ptr.div.iS37_D = sdiv <2 x i64>
%sub.ptr.sub.iS36_D, <i64 24, i64 24></div>
</span><span class=""><div> %cmp10S38_D = icmp ugt <2 x i64> %sub.ptr.div.iS37_D,
%splatInsMapS1_D.splat</div>
<div> %zextS39_D = sext <2 x i1> %cmp10S38_D to <2 x
i64></div>
<div> %BCS39_D = bitcast <2 x i64> %zextS39_D to i128</div>
<div> %mskS39_D = icmp ne i128 %BCS39_D, 0</div>
</span><div><span class=""> br i1 %mskS39_D, label %if.then11, label %if.else<br>
<br></span>
It looks like %msk539_D is basically a test of whether either of
the vector elements produced by the divide are ugt the
spatInstMap. I can't say for sure that we can do better here - I
haven't studied our vector canonicalization rules enough - but
this seems like something which could possibly be improved. <br>
<br>
This is interesting:<span class=""><br>
<div> %splatCallS27_D.splatinsert = insertelement <2 x i8*>
undef, i8* %call5.i.i, i32 0</div>
<div> %splatCallS27_D.splat = shufflevector <2 x i8*>
%splatCallS27_D.splatinsert, <2 x i8*> undef, <2 x
i32> zeroinitializer</div>
<br></span>
Can't that shuifflevector be replaced with:<br>
%splatCallS27_D.splat = insertelement <2 x i8*>
%splatCallS27_D.splatinsert , i8* %call5.i.i, i32 1<br>
<br>
Again, without knowledge of how we canonicalize such things, not
necessarily a win. Just suspicious. <br>
<br>
The bitcast/extractelement sequence following that shufflevector
is also interesting. It looks like it could be rewritten in terms
of the i8* %call5.i.i and a bitcast. <br>
</div><div><div class="h5">
<br>
<br>
<div>On 07/24/2015 10:52 AM, zhi chen wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>
<div>------------------------------------ IR
------------------------------------------------------------------</div>
<div>if.then.i.i.i.i.i.i: ; preds
= %if.then4</div>
<div> %S25_D = zext <2 x i32> %splatLDS17_D.splat to
<2 x i64></div>
<div> %umul_with_overflow.i.iS26_D = shl <2 x i64>
%S25_D, <i64 3, i64 3></div>
<div> %extumul_with_overflow.i.iS26_D = extractelement <2
x i64> %umul_with_overflow.i.iS26_D, i32 1</div>
<div> %call5.i.i = tail call noalias i8* @_Znam(i64
%extumul_with_overflow.i.iS26_D) #22</div>
<div> %splatCallS27_D.splatinsert = insertelement <2 x
i8*> undef, i8* %call5.i.i, i32 0</div>
<div> %splatCallS27_D.splat = shufflevector <2 x i8*>
%splatCallS27_D.splatinsert, <2 x i8*> undef, <2 x
i32> zeroinitializer</div>
<div> %bitcastS28_D = bitcast <2 x i8*>
%splatCallS27_D.splat to <2 x double*></div>
<div> %extractS29_D = extractelement <2 x double*>
%bitcastS28_D, i32 1</div>
<div> store double* %extractS29_D, double** %val.i.i, align 8</div>
<div> %val.i3.i.i = getelementptr inbounds %class.Vector*
%__x, i64 0, i32 3</div>
<div> %4 = load double** %val.i3.i.i, align 8, !tbaa !22</div>
<div> %splatLDS31_D.splatinsert = insertelement <2 x
double*> undef, double* %4, i32 0</div>
<div> %splatLDS31_D.splat = shufflevector <2 x double*>
%splatLDS31_D.splatinsert, <2 x double*> undef, <2
x i32> zeroinitializer</div>
<div> %bitcastS32_D = bitcast <2 x double*>
%splatLDS31_D.splat to <2 x i8*></div>
<div> %extbitcastS32_D = extractelement <2 x i8*>
%bitcastS32_D, i32 1</div>
<div> tail call void @llvm.memmove.p0i8.p0i8.i64(i8*
%call5.i.i, i8* %extbitcastS32_D, i64
%extumul_with_overflow.i.iS26_D, i32 8, i1 false) #9</div>
<div> br label %invoke.cont</div>
<div><br>
</div>
<div>invoke.cont: ; preds
= %if.then.i.i.i.i.i.i, %if.then4</div>
<div> %sub.ptr.rhs.cast.i = ptrtoint %class.Vector*
%__position.coerce to i64</div>
<div> %sub.ptr.rhs.cast.iS35_D = ptrtoint <2 x
%class.Vector*> %splatInsMapS35_D.splat to <2 x
i64></div>
<div> %sub.ptr.sub.iS36_D = sub <2 x i64>
%sub.ptr.rhs.castS8_D, %sub.ptr.rhs.cast.iS35_D</div>
<div> %sub.ptr.div.iS37_D = sdiv <2 x i64>
%sub.ptr.sub.iS36_D, <i64 24, i64 24></div>
<div> %extractS196_D = extractelement <2 x i64>
%sub.ptr.div.iS37_D, i32 1</div>
<div> %cmp10S38_D = icmp ugt <2 x i64>
%sub.ptr.div.iS37_D, %splatInsMapS1_D.splat</div>
<div> %zextS39_D = sext <2 x i1> %cmp10S38_D to <2 x
i64></div>
<div> %BCS39_D = bitcast <2 x i64> %zextS39_D to i128</div>
<div> %mskS39_D = icmp ne i128 %BCS39_D, 0</div>
<div> br i1 %mskS39_D, label %if.then11, label %if.else</div>
</div>
<div><br>
</div>
<div>-------------------------------------------- Assembly
-----------------------------------------------------------------</div>
<div><br>
</div>
<div>
<div># BB#3: #
%if.then.i.i.i.i.i.i</div>
<div> vpsllq $3, %xmm0, %xmm0</div>
<div> vpextrq $1, %xmm0, %rbx</div>
<div> movq %rbx, %rdi</div>
<div> vmovaps %xmm2, 96(%rsp) # 16-byte Spill</div>
<div> vmovaps %xmm5, 64(%rsp) # 16-byte Spill</div>
<div> vmovdqa %xmm6, 16(%rsp) # 16-byte Spill</div>
<div> callq _Znam</div>
<div> movq %rax, 128(%rsp)</div>
<div> movq 16(%r12), %rsi</div>
<div> movq %rax, %rdi</div>
<div> movq %rbx, %rdx</div>
<div> callq memmove</div>
<div> vmovdqa 16(%rsp), %xmm6 # 16-byte Reload</div>
<div> vmovaps 64(%rsp), %xmm5 # 16-byte Reload</div>
<div> vmovaps 96(%rsp), %xmm2 # 16-byte Reload</div>
<div> vmovdqa .LCPI582_0(%rip), %xmm4</div>
<div>.LBB582_4: # %invoke.cont</div>
<div> vmovaps %xmm2, 96(%rsp) # 16-byte Spill</div>
<div> vmovdqa 48(%rsp), %xmm0 # 16-byte Reload</div>
<div> vpsubq %xmm0, %xmm2, %xmm0</div>
<div> vpextrq $1, %xmm0, %rax</div>
<div> movabsq $3074457345618258603, %rcx # imm =
0x2AAAAAAAAAAAAAAB</div>
<div> imulq %rcx</div>
<div> movq %rdx, %rax</div>
<div> shrq $63, %rax</div>
<div> sarq $2, %rdx</div>
<div> addq %rax, %rdx</div>
<div> vmovq %rdx, %xmm1</div>
<div> vmovq %xmm0, %rax</div>
<div> imulq %rcx</div>
<div> movq %rdx, %rax</div>
<div> shrq $63, %rax</div>
<div> sarq $2, %rdx</div>
<div> addq %rax, %rdx</div>
<div> vmovq %rdx, %xmm0</div>
</div>
<div>
<div> vpunpcklqdq %xmm1, %xmm0, %xmm1 # xmm1 =
xmm0[0],xmm1[0]</div>
<div> vpxor %xmm4, %xmm1, %xmm0</div>
<div> vpcmpgtq %xmm6, %xmm0, %xmm0</div>
<div> vptest %xmm0, %xmm0</div>
<div> je .LBB582_49</div>
</div>
<div><br>
</div>
<div>Thanks,</div>
<div>Zhi</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On Fri, Jul 24, 2015 at 10:16 AM,
Philip Reames <span dir="ltr"><<a href="mailto:listmail@philipreames.com" target="_blank">listmail@philipreames.com</a>></span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br>
<br>
On 07/24/2015 03:42 AM, Benjamin Kramer wrote:
<div>
<div><br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
On 24.07.2015, at 08:06, zhi chen <<a href="mailto:zchenhn@gmail.com" target="_blank">zchenhn@gmail.com</a>>
wrote:<br>
<br>
It seems that that it's hard to vectorize int64 in
LLVM. For example, LLVM 3.4 generates very
complicated code for the following IR. I am running
on a Haswell processor. Is it because there is no
alternative AVX/2 instructions for int64? The same
thing also happens to zext <2 x i32> ->
<2 x i64> and trunc <2 x i64> ->
<2 x i32>. Any ideas to optimize these
instructions? Thanks.<br>
<br>
%sub.ptr.sub.i6.i.i.i.i = sub <2 x i64>
%sub.ptr.lhs.cast.i4.i.i.i.i,
%sub.ptr.rhs.cast.i5.i.i.i.i<br>
%sub.ptr.div.i7.i.i.i.i = sdiv <2 x i64>
%sub.ptr.sub.i6.i.i.i.i, <i64 24, i64 24><br>
<br>
Assembly:<br>
vpsubq %xmm6, %xmm5, %xmm5<br>
vmovq %xmm5, %rax<br>
movabsq $3074457345618258603, %rbx # imm =
0x2AAAAAAAAAAAAAAB<br>
imulq %rbx<br>
movq %rdx, %rcx<br>
movq %rcx, %rax<br>
shrq $63, %rax<br>
shrq $2, %rcx<br>
addl %eax, %ecx<br>
vpextrq $1, %xmm5, %rax<br>
imulq %rbx<br>
movq %rdx, %rax<br>
shrq $63, %rax<br>
shrq $2, %rdx<br>
addl %eax, %edx<br>
movslq %edx, %rax<br>
vmovq %rax, %xmm5<br>
movslq %ecx, %rax<br>
vmovq %rax, %xmm6<br>
vpunpcklqdq %xmm5, %xmm6, %xmm5 # xmm5 =
xmm6[0],xmm5[0]<br>
</blockquote>
AVX2 doesn't have integer vector division instructions
and LLVM lowers divides by constants into (128 bit)
multiplies. However, AVX2 doesn't have a way to get to
the upper 64 bits of a 64x64->128 bit multiply
either, so LLVM uses the scalar imulq instruction to
do that. There's not much room to optimize here given
the limitations of AVX2.<br>
<br>
You seem to be subtracting pointers though, so if you
can guarantee that the pointers are aligned you could
set the exact bit on your 'sdiv' instruction. That
should give better code.<br>
</blockquote>
</div>
</div>
Depending on what you're using the result of the divide for,
there might be optimizations which could be applied as
well. Can you give a slightly larger context for your
source IR? (1-2 level of uses/defs out from the
instructions would help)<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
- Ben<br>
<br>
<br>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:LLVMdev@cs.uiuc.edu" target="_blank">LLVMdev@cs.uiuc.edu</a>
<a href="http://llvm.cs.uiuc.edu" rel="noreferrer" target="_blank">http://llvm.cs.uiuc.edu</a><br>
<a href="http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev" rel="noreferrer" target="_blank">http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev</a><br>
</blockquote>
<br>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</div></div></div>
</blockquote></div><br></div></div></div></div>