[PATCH] D21560: Relax the clearance calculating for breaking partial register dependency.

Thu Jun 23 08:42:46 PDT 2016

spatel added a comment.

In http://reviews.llvm.org/D21560#465070, @danielcdh wrote:

> The performance of attached testcase running on haswell: (to reproduce, need to build with  -O2 -fno-tree-vectorize)

I missed -fno-tree-vectorize in my earlier experiment. With that setting, I am able to reproduce the Haswell win. This is at nominal 4GHz:

  lessxor: user   0m2.977s
  morexor: user	0m2.068s

I would expect all OoO SSE machines to have the same problem, and testing on AMD Jaguar generally shows that. But in this particular case, performance gets worse. This is at nominal 1.5GHz:

  lessxor: user	0m11.916s
  morexor: user	0m12.795s

I don't have an explanation for that loss yet. The loop is optimally aligned on a 64-byte boundary. The extra xorps adds 3 bytes causing the inner loop to grow from an even 80 bytes to 83 bytes. The extra bytes require an additional ifetch operation that is somehow slowing down the whole chain?

Note that the partial update problem is limited to SSE codegen. If we use -mavx, we generate the 'v' versions of the conversion instructions. Those do not have partial reg update problems, and so we don't need any (v)xorps instructions in the loop. Performance for AVX versions of the program are slightly better than the best SSE case for both CPUs.

Here's the asm code for the inner loop that I'm testing with:

  LBB0_4:                                 ## %for.body6
                                          ##   Parent Loop BB0_2 Depth=1
                                          ## =>  This Inner Loop Header: Depth=2
    movl  (%rbx), %eax
    movzbl  %ah, %edi  # NOREX
    xorps %xmm4, %xmm4  <--- this gets generated with the current '16' clearance setting
    cvtsi2ssl %edi, %xmm4
    mulss %xmm0, %xmm4
    movl  %eax, %edi
    shrl  $16, %edi
    movzbl  %dil, %edi
    xorps %xmm5, %xmm5  <--- this is the added instruction generated by this patch
    cvtsi2ssl %edi, %xmm5
    mulss %xmm1, %xmm5
    addss %xmm4, %xmm5
    shrl  $24, %eax
    xorps %xmm4, %xmm4  <--- this gets generated with the current '16' clearance setting
    cvtsi2ssl %eax, %xmm4
    mulss %xmm2, %xmm4
    addss %xmm5, %xmm4
    cvtss2sd  %xmm4, %xmm4
    addsd %xmm3, %xmm4
    cvttsd2si %xmm4, %eax
    movb  %al, (%rsi)
    addq  $4, %rbx
    incq  %rsi
    decl  %r15d
    jne LBB0_4

http://reviews.llvm.org/D21560