            Bug ID: 33325
           Summary: [X86][SSE] Improve equality memcmp support
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: Backend: X86
          Assignee: unassignedbugs at nondot.org
          Reporter: llvm-dev at redking.me.uk
                CC: filcab at gmail.com, llvm-bugs at lists.llvm.org,
                    spatel+llvm at rotateright.com

int cmpeq16(const char *a, const char *b) {
        return 0 == __builtin_memcmp(a, b, 16);
int cmpeq32(const char *a, const char *b) {
        return 0 == __builtin_memcmp(a, b, 32);

On SSE2-AVX1 targets, equality memcmp of 16 bytes lowers using the SIMD unit
but remains scalarized for 32 byte compares:

cmpeq16(char const*, char const*):                       # @cmpeq16(char
const*, char const*)
        vmovdqu (%rdi), %xmm0
        xorl    %eax, %eax
        vpcmpeqb        (%rsi), %xmm0, %xmm0
        vpmovmskb       %xmm0, %ecx
        cmpl    $65535, %ecx            # imm = 0xFFFF
        sete    %al

cmpeq32(char const*, char const*):                       # @cmpeq32(char
const*, char const*)
        movq    16(%rdi), %rax
        movq    (%rdi), %rcx
        movq    8(%rdi), %rdx
        movq    24(%rdi), %rdi
        xorq    24(%rsi), %rdi
        xorq    8(%rsi), %rdx
        xorq    16(%rsi), %rax
        xorq    (%rsi), %rcx
        orq     %rax, %rcx
        orq     %rdi, %rdx
        xorl    %eax, %eax
        orq     %rdx, %rcx
        sete    %al

cmpeq32 is even worse on 32-bit targets....

Ideally it'd be something like:

cmpeq32(char const*, char const*):
        vmovdqu (%rdi), %xmm0
        vmovdqu 16(%rdi), %xmm1
        xorl    %eax, %eax
        vpcmpeqb        (%rsi), %xmm0, %xmm0
        vpcmpeqb        16(%rsi), %xmm1, %xmm1
        vpand        %xmm1, %xmm0, %xmm0
        vpmovmskb       %xmm0, %ecx
        cmpl    $65535, %ecx            # imm = 0xFFFF
        sete    %al

I'm not sure what the upper limit should be but 32-bytes on SSE2-AVX1 and
64-bytes on AVX2 should definitely be fine (no idea what the best solution is
on AVX512).

