[PATCH] D31290: [x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality

Thu Mar 23 14:32:40 PDT 2017

spatel added inline comments.

================
Comment at: test/CodeGen/X86/memcmp.ll:104
+; CHECK-NEXT:    pmovmskb %xmm1, %eax
+; CHECK-NEXT:    cmpl $65535, %eax # imm = 0xFFFF
 ; CHECK-NEXT:    setne %al
----------------
spatel wrote:
> efriedma wrote:
> > What's the performance of this compared to using integer registers?  (movq+xorq+movq+xorq+orq).
> Hmm...didn't consider that option since movmsk has been fast for a long time and scalar always needs more ops. We'd need to separate x86-32 from x86-64 too. I'll try to get some real numbers.
> 
I benchmarked the 2 sequences shown below and the libcall. On Haswell with macOS, I'm seeing more wobble in these numbers than I can explain, but:

memcmp     : 34485936 cycles for 1048576 iterations (32.89 cycles/iter).
vec cmp    : 5245888 cycles for 1048576 iterations (5.00 cycles/iter).
xor cmp    : 5247940 cycles for 1048576 iterations (5.00 cycles/iter).

On Ubuntu with AMD Jaguar:

memcmp     : 21150343 cycles for 1048576 iterations (20.17 cycles/iter).
vec cmp    : 9988395 cycles for 1048576 iterations (9.53 cycles/iter).
xor cmp    : 9471849 cycles for 1048576 iterations (9.03 cycles/iter).

  .align  6, 0x90
  .global _cmp16vec
  _cmp16vec:
  movdqu (%rsi), %xmm0
  movdqu (%rdi), %xmm1
  pcmpeqb %xmm0, %xmm1
  pmovmskb %xmm1, %eax
  cmpl $65535, %eax
  setne %al
  movzbl  %al, %eax
  retq

  .align  6, 0x90
  .global _cmp16scalar
  _cmp16scalar:
  movq  (%rsi), %rax
  movq  8(%rsi), %rcx
  xorq  (%rdi), %rax
  xorq  8(%rdi), %rcx
  orq %rax, %rcx
  setne %al
  movzbl  %al, %eax
  retq

https://reviews.llvm.org/D31290