[PATCH] D31290: [x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality
Sanjay Patel via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Mar 23 14:32:40 PDT 2017
spatel added inline comments.
================
Comment at: test/CodeGen/X86/memcmp.ll:104
+; CHECK-NEXT: pmovmskb %xmm1, %eax
+; CHECK-NEXT: cmpl $65535, %eax # imm = 0xFFFF
; CHECK-NEXT: setne %al
----------------
spatel wrote:
> efriedma wrote:
> > What's the performance of this compared to using integer registers? (movq+xorq+movq+xorq+orq).
> Hmm...didn't consider that option since movmsk has been fast for a long time and scalar always needs more ops. We'd need to separate x86-32 from x86-64 too. I'll try to get some real numbers.
>
I benchmarked the 2 sequences shown below and the libcall. On Haswell with macOS, I'm seeing more wobble in these numbers than I can explain, but:
memcmp : 34485936 cycles for 1048576 iterations (32.89 cycles/iter).
vec cmp : 5245888 cycles for 1048576 iterations (5.00 cycles/iter).
xor cmp : 5247940 cycles for 1048576 iterations (5.00 cycles/iter).
On Ubuntu with AMD Jaguar:
memcmp : 21150343 cycles for 1048576 iterations (20.17 cycles/iter).
vec cmp : 9988395 cycles for 1048576 iterations (9.53 cycles/iter).
xor cmp : 9471849 cycles for 1048576 iterations (9.03 cycles/iter).
.align 6, 0x90
.global _cmp16vec
_cmp16vec:
movdqu (%rsi), %xmm0
movdqu (%rdi), %xmm1
pcmpeqb %xmm0, %xmm1
pmovmskb %xmm1, %eax
cmpl $65535, %eax
setne %al
movzbl %al, %eax
retq
.align 6, 0x90
.global _cmp16scalar
_cmp16scalar:
movq (%rsi), %rax
movq 8(%rsi), %rcx
xorq (%rdi), %rax
xorq 8(%rdi), %rcx
orq %rax, %rcx
setne %al
movzbl %al, %eax
retq
https://reviews.llvm.org/D31290
More information about the llvm-commits
mailing list