[cfe-dev] __builtin_parity assembly without popcnt
Bruce Hoult via cfe-dev
cfe-dev at lists.llvm.org
Fri Jul 20 11:36:58 PDT 2018
This looks (not closely checked) like it's implementing the first part of
pop1() combined with the last part of pop3(). Which is suggested in the
comment just after pop1().
http://www.hackersdelight.org/hdcodetxt/pop.c.txt
And then anding with 1, obviously.
I wonder if division is good enough on modern machines to make pop2()
faster.
XORing down to a byte and then using the x86 built in parity flag is
obviously better if you are on an x86, of course. Other machines don't
usually have that.
On Fri, Jul 20, 2018 at 11:02 AM, via cfe-dev <cfe-dev at lists.llvm.org>
wrote:
> Hello all!
>
>
>
> I’ve been mucking around in an old codebase at work looking for easy
> performance wins. One avenue involves replacing a switch-based variable
> assignment with something derived from the parity of an input variable. I
> was pretty surprised when I saw the generated assembly, and I’m wondering
> about the reasoning behind it.
>
>
>
> In short, it boils down to the assembly __builtin_parity() produces.
> Clang 6.0.1 (and trunk on Godbolt) produces:
>
>
>
> parity(int): # @parity(int)
>
> mov eax, edi
>
> shr eax
>
> and eax, 1431655765
>
> sub edi, eax
>
> mov eax, edi
>
> and eax, 858993459
>
> shr edi, 2
>
> and edi, 858993459
>
> add edi, eax
>
> mov eax, edi
>
> shr eax, 4
>
> add eax, edi
>
> and eax, 17764111
>
> imul eax, eax, 16843009
>
> shr eax, 24
>
> and eax, 1
>
> ret
>
>
>
> While GCC 8.1.0 (and trunk on Godbolt) produces
>
>
>
> parity(int):
>
> mov eax, edi
>
> shr edi, 16
>
> xor eax, edi
>
> xor al, ah
>
> setnp al
>
> movzx eax, al
>
> ret
>
>
>
> I know a popcnt followed by an and would be better, but unfortunately some
> of my users don’t have computers that support the popcnt instruction, so I
> can’t use a newer -march flag.
>
>
>
> Could someone explain why the difference between Clang and GCC here, and
> whether it should make a difference? The code in question is in a hot loop
> in my code, so I’d imagine the size difference could impact unrolling (and
> result in icache differences too), but I haven’t finished poking around
> with benchmarks.
>
>
>
> Thanks,
>
>
>
> Alex
>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180720/7999aea9/attachment.html>
More information about the cfe-dev
mailing list