[PATCH] D21560: Relax the clearance calculating for breaking partial register dependency.
Sanjay Patel via llvm-commits
llvm-commits at lists.llvm.org
Thu Jun 23 08:42:46 PDT 2016
spatel added a comment.
In http://reviews.llvm.org/D21560#465070, @danielcdh wrote:
> The performance of attached testcase running on haswell: (to reproduce, need to build with -O2 -fno-tree-vectorize)
I missed -fno-tree-vectorize in my earlier experiment. With that setting, I am able to reproduce the Haswell win. This is at nominal 4GHz:
lessxor: user 0m2.977s
morexor: user 0m2.068s
I would expect all OoO SSE machines to have the same problem, and testing on AMD Jaguar generally shows that. But in this particular case, performance gets worse. This is at nominal 1.5GHz:
lessxor: user 0m11.916s
morexor: user 0m12.795s
I don't have an explanation for that loss yet. The loop is optimally aligned on a 64-byte boundary. The extra xorps adds 3 bytes causing the inner loop to grow from an even 80 bytes to 83 bytes. The extra bytes require an additional ifetch operation that is somehow slowing down the whole chain?
Note that the partial update problem is limited to SSE codegen. If we use -mavx, we generate the 'v' versions of the conversion instructions. Those do not have partial reg update problems, and so we don't need any (v)xorps instructions in the loop. Performance for AVX versions of the program are slightly better than the best SSE case for both CPUs.
Here's the asm code for the inner loop that I'm testing with:
LBB0_4: ## %for.body6
## Parent Loop BB0_2 Depth=1
## => This Inner Loop Header: Depth=2
movl (%rbx), %eax
movzbl %ah, %edi # NOREX
xorps %xmm4, %xmm4 <--- this gets generated with the current '16' clearance setting
cvtsi2ssl %edi, %xmm4
mulss %xmm0, %xmm4
movl %eax, %edi
shrl $16, %edi
movzbl %dil, %edi
xorps %xmm5, %xmm5 <--- this is the added instruction generated by this patch
cvtsi2ssl %edi, %xmm5
mulss %xmm1, %xmm5
addss %xmm4, %xmm5
shrl $24, %eax
xorps %xmm4, %xmm4 <--- this gets generated with the current '16' clearance setting
cvtsi2ssl %eax, %xmm4
mulss %xmm2, %xmm4
addss %xmm5, %xmm4
cvtss2sd %xmm4, %xmm4
addsd %xmm3, %xmm4
cvttsd2si %xmm4, %eax
movb %al, (%rsi)
addq $4, %rbx
incq %rsi
decl %r15d
jne LBB0_4
http://reviews.llvm.org/D21560
More information about the llvm-commits
mailing list