[PATCH] D31290: [x86] use PMOVMSK to replace memcmp libcalls for 16-byte equality
Sanjay Patel via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Thu Mar 23 15:36:29 PDT 2017
spatel added inline comments.
================
Comment at: test/CodeGen/X86/memcmp.ll:104
+; CHECK-NEXT: pmovmskb %xmm1, %eax
+; CHECK-NEXT: cmpl $65535, %eax # imm = 0xFFFF
; CHECK-NEXT: setne %al
----------------
spatel wrote:
> spatel wrote:
> > efriedma wrote:
> > > What's the performance of this compared to using integer registers? (movq+xorq+movq+xorq+orq).
> > Hmm...didn't consider that option since movmsk has been fast for a long time and scalar always needs more ops. We'd need to separate x86-32 from x86-64 too. I'll try to get some real numbers.
> >
> I benchmarked the 2 sequences shown below and the libcall. On Haswell with macOS, I'm seeing more wobble in these numbers than I can explain, but:
>
> memcmp : 34485936 cycles for 1048576 iterations (32.89 cycles/iter).
> vec cmp : 5245888 cycles for 1048576 iterations (5.00 cycles/iter).
> xor cmp : 5247940 cycles for 1048576 iterations (5.00 cycles/iter).
>
> On Ubuntu with AMD Jaguar:
>
> memcmp : 21150343 cycles for 1048576 iterations (20.17 cycles/iter).
> vec cmp : 9988395 cycles for 1048576 iterations (9.53 cycles/iter).
> xor cmp : 9471849 cycles for 1048576 iterations (9.03 cycles/iter).
>
>
> .align 6, 0x90
> .global _cmp16vec
> _cmp16vec:
> movdqu (%rsi), %xmm0
> movdqu (%rdi), %xmm1
> pcmpeqb %xmm0, %xmm1
> pmovmskb %xmm1, %eax
> cmpl $65535, %eax
> setne %al
> movzbl %al, %eax
> retq
>
> .align 6, 0x90
> .global _cmp16scalar
> _cmp16scalar:
> movq (%rsi), %rax
> movq 8(%rsi), %rcx
> xorq (%rdi), %rax
> xorq 8(%rdi), %rcx
> orq %rax, %rcx
> setne %al
> movzbl %al, %eax
> retq
>
There will be bugs:
https://bugs.llvm.org/show_bug.cgi?id=32401
https://reviews.llvm.org/D31290
More information about the llvm-commits
mailing list