[cfe-dev] __builtin_parity assembly without popcnt

via cfe-dev cfe-dev at lists.llvm.org
Fri Jul 20 11:02:11 PDT 2018

Hello all!


I've been mucking around in an old codebase at work looking for easy
performance wins. One avenue involves replacing a switch-based variable
assignment with something derived from the parity of an input variable. I
was pretty surprised when I saw the generated assembly, and I'm wondering
about the reasoning behind it.


In short, it boils down to the assembly __builtin_parity() produces. Clang
6.0.1 (and trunk on Godbolt) produces:


parity(int):                             # @parity(int)

        mov     eax, edi

        shr     eax

        and     eax, 1431655765

        sub     edi, eax

        mov     eax, edi

        and     eax, 858993459

        shr     edi, 2

        and     edi, 858993459

        add     edi, eax

        mov     eax, edi

        shr     eax, 4

        add     eax, edi

        and     eax, 17764111

        imul    eax, eax, 16843009

        shr     eax, 24

        and     eax, 1



While GCC 8.1.0 (and trunk on Godbolt) produces



        mov     eax, edi

        shr     edi, 16

        xor     eax, edi

        xor     al, ah

        setnp   al

        movzx   eax, al



I know a popcnt followed by an and would be better, but unfortunately some
of my users don't have computers that support the popcnt instruction, so I
can't use a newer -march flag.


Could someone explain why the difference between Clang and GCC here, and
whether it should make a difference? The code in question is in a hot loop
in my code, so I'd imagine the size difference could impact unrolling (and
result in icache differences too), but I haven't finished poking around with





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180720/2e9ce83a/attachment.html>

More information about the cfe-dev mailing list