[cfe-dev] __builtin_parity assembly without popcnt

via cfe-dev cfe-dev at lists.llvm.org
Mon Jul 23 07:21:15 PDT 2018

This looks (not closely checked) like it's implementing the first part of pop1() combined with the last part of pop3(). Which is suggested in the comment just after pop1().




And then anding with 1, obviously.


I wonder if division is good enough on modern machines to make pop2() faster.


XORing down to a byte and then using the x86 built in parity flag is obviously better if you are on an x86, of course. Other machines don't usually have that.


Oh, using an x86-specific feature would explain a lot. Actually gives me some ideas as to some more possible microoptimizations…


Thanks to you (and Eli) for the explanation!


On Fri, Jul 20, 2018 at 11:02 AM, via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org> > wrote:

Hello all!


I’ve been mucking around in an old codebase at work looking for easy performance wins. One avenue involves replacing a switch-based variable assignment with something derived from the parity of an input variable. I was pretty surprised when I saw the generated assembly, and I’m wondering about the reasoning behind it.


In short, it boils down to the assembly __builtin_parity() produces. Clang 6.0.1 (and trunk on Godbolt) produces:


parity(int):                             # @parity(int)

        mov     eax, edi

        shr     eax

        and     eax, 1431655765

        sub     edi, eax

        mov     eax, edi

        and     eax, 858993459

        shr     edi, 2

        and     edi, 858993459

        add     edi, eax

        mov     eax, edi

        shr     eax, 4

        add     eax, edi

        and     eax, 17764111

        imul    eax, eax, 16843009

        shr     eax, 24

        and     eax, 1



While GCC 8.1.0 (and trunk on Godbolt) produces



        mov     eax, edi

        shr     edi, 16

        xor     eax, edi

        xor     al, ah

        setnp   al

        movzx   eax, al



I know a popcnt followed by an and would be better, but unfortunately some of my users don’t have computers that support the popcnt instruction, so I can’t use a newer -march flag.


Could someone explain why the difference between Clang and GCC here, and whether it should make a difference? The code in question is in a hot loop in my code, so I’d imagine the size difference could impact unrolling (and result in icache differences too), but I haven’t finished poking around with benchmarks.





cfe-dev mailing list
cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180723/b9f24b80/attachment.html>

More information about the cfe-dev mailing list