[cfe-dev] __builtin_parity assembly without popcnt

Fri Jul 20 11:02:11 PDT 2018

Hello all!

I've been mucking around in an old codebase at work looking for easy
performance wins. One avenue involves replacing a switch-based variable
assignment with something derived from the parity of an input variable. I
was pretty surprised when I saw the generated assembly, and I'm wondering
about the reasoning behind it.

In short, it boils down to the assembly __builtin_parity() produces. Clang
6.0.1 (and trunk on Godbolt) produces:

parity(int):                             # @parity(int)

        mov     eax, edi

        shr     eax

        and     eax, 1431655765

        sub     edi, eax

        mov     eax, edi

        and     eax, 858993459

        shr     edi, 2

        and     edi, 858993459

        add     edi, eax

        mov     eax, edi

        shr     eax, 4

        add     eax, edi

        and     eax, 17764111

        imul    eax, eax, 16843009

        shr     eax, 24

        and     eax, 1

        ret

While GCC 8.1.0 (and trunk on Godbolt) produces

parity(int):

        mov     eax, edi

        shr     edi, 16

        xor     eax, edi

        xor     al, ah

        setnp   al

        movzx   eax, al

        ret

I know a popcnt followed by an and would be better, but unfortunately some
of my users don't have computers that support the popcnt instruction, so I
can't use a newer -march flag.

Could someone explain why the difference between Clang and GCC here, and
whether it should make a difference? The code in question is in a hot loop
in my code, so I'd imagine the size difference could impact unrolling (and
result in icache differences too), but I haven't finished poking around with
benchmarks.

Thanks,

Alex 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180720/2e9ce83a/attachment.html>