[PATCH] D31290: [x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality

Thu Mar 23 15:36:29 PDT 2017

spatel added inline comments.

================
Comment at: test/CodeGen/X86/memcmp.ll:104
+; CHECK-NEXT:    pmovmskb %xmm1, %eax
+; CHECK-NEXT:    cmpl $65535, %eax # imm = 0xFFFF
 ; CHECK-NEXT:    setne %al
----------------
spatel wrote:
> spatel wrote:
> > efriedma wrote:
> > > What's the performance of this compared to using integer registers?  (movq+xorq+movq+xorq+orq).
> > Hmm...didn't consider that option since movmsk has been fast for a long time and scalar always needs more ops. We'd need to separate x86-32 from x86-64 too. I'll try to get some real numbers.
> > 
> I benchmarked the 2 sequences shown below and the libcall. On Haswell with macOS, I'm seeing more wobble in these numbers than I can explain, but:
> 
> memcmp     : 34485936 cycles for 1048576 iterations (32.89 cycles/iter).
> vec cmp    : 5245888 cycles for 1048576 iterations (5.00 cycles/iter).
> xor cmp    : 5247940 cycles for 1048576 iterations (5.00 cycles/iter).
> 
> On Ubuntu with AMD Jaguar:
> 
> memcmp     : 21150343 cycles for 1048576 iterations (20.17 cycles/iter).
> vec cmp    : 9988395 cycles for 1048576 iterations (9.53 cycles/iter).
> xor cmp    : 9471849 cycles for 1048576 iterations (9.03 cycles/iter).
> 
> 
>   .align  6, 0x90
>   .global _cmp16vec
>   _cmp16vec:
>   movdqu (%rsi), %xmm0
>   movdqu (%rdi), %xmm1
>   pcmpeqb %xmm0, %xmm1
>   pmovmskb %xmm1, %eax
>   cmpl $65535, %eax
>   setne %al
>   movzbl  %al, %eax
>   retq
> 
>   .align  6, 0x90
>   .global _cmp16scalar
>   _cmp16scalar:
>   movq  (%rsi), %rax
>   movq  8(%rsi), %rcx
>   xorq  (%rdi), %rax
>   xorq  8(%rdi), %rcx
>   orq %rax, %rcx
>   setne %al
>   movzbl  %al, %eax
>   retq
> 
There will be bugs:
https://bugs.llvm.org/show_bug.cgi?id=32401

https://reviews.llvm.org/D31290