[cfe-dev] __builtin_parity assembly without popcnt

Fri Jul 20 11:36:58 PDT 2018

This looks (not closely checked) like it's implementing the first part of
pop1() combined with the last part of pop3(). Which is suggested in the
comment just after pop1().

http://www.hackersdelight.org/hdcodetxt/pop.c.txt

And then anding with 1, obviously.

I wonder if division is good enough on modern machines to make pop2()
faster.

XORing down to a byte and then using the x86 built in parity flag is
obviously better if you are on an x86, of course. Other machines don't
usually have that.

On Fri, Jul 20, 2018 at 11:02 AM, via cfe-dev <cfe-dev at lists.llvm.org>
wrote:

> Hello all!
>
>
>
> I’ve been mucking around in an old codebase at work looking for easy
> performance wins. One avenue involves replacing a switch-based variable
> assignment with something derived from the parity of an input variable. I
> was pretty surprised when I saw the generated assembly, and I’m wondering
> about the reasoning behind it.
>
>
>
> In short, it boils down to the assembly __builtin_parity() produces.
> Clang 6.0.1 (and trunk on Godbolt) produces:
>
>
>
> parity(int):                             # @parity(int)
>
>         mov     eax, edi
>
>         shr     eax
>
>         and     eax, 1431655765
>
>         sub     edi, eax
>
>         mov     eax, edi
>
>         and     eax, 858993459
>
>         shr     edi, 2
>
>         and     edi, 858993459
>
>         add     edi, eax
>
>         mov     eax, edi
>
>         shr     eax, 4
>
>         add     eax, edi
>
>         and     eax, 17764111
>
>         imul    eax, eax, 16843009
>
>         shr     eax, 24
>
>         and     eax, 1
>
>         ret
>
>
>
> While GCC 8.1.0 (and trunk on Godbolt) produces
>
>
>
> parity(int):
>
>         mov     eax, edi
>
>         shr     edi, 16
>
>         xor     eax, edi
>
>         xor     al, ah
>
>         setnp   al
>
>         movzx   eax, al
>
>         ret
>
>
>
> I know a popcnt followed by an and would be better, but unfortunately some
> of my users don’t have computers that support the popcnt instruction, so I
> can’t use a newer -march flag.
>
>
>
> Could someone explain why the difference between Clang and GCC here, and
> whether it should make a difference? The code in question is in a hot loop
> in my code, so I’d imagine the size difference could impact unrolling (and
> result in icache differences too), but I haven’t finished poking around
> with benchmarks.
>
>
>
> Thanks,
>
>
>
> Alex
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180720/7999aea9/attachment.html>