[cfe-dev] __builtin_parity assembly without popcnt

Mon Jul 23 09:31:15 PDT 2018

Why does gcc xor down to a byte? Can't you get the parity flag for the
whole value by just oring with itself and grabbing the flag?

~Craig

On Fri, Jul 20, 2018 at 11:37 AM Bruce Hoult via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> This looks (not closely checked) like it's implementing the first part of
> pop1() combined with the last part of pop3(). Which is suggested in the
> comment just after pop1().
>
> http://www.hackersdelight.org/hdcodetxt/pop.c.txt
>
> And then anding with 1, obviously.
>
> I wonder if division is good enough on modern machines to make pop2()
> faster.
>
> XORing down to a byte and then using the x86 built in parity flag is
> obviously better if you are on an x86, of course. Other machines don't
> usually have that.
>
>
> On Fri, Jul 20, 2018 at 11:02 AM, via cfe-dev <cfe-dev at lists.llvm.org>
> wrote:
>
>> Hello all!
>>
>>
>>
>> I’ve been mucking around in an old codebase at work looking for easy
>> performance wins. One avenue involves replacing a switch-based variable
>> assignment with something derived from the parity of an input variable. I
>> was pretty surprised when I saw the generated assembly, and I’m wondering
>> about the reasoning behind it.
>>
>>
>>
>> In short, it boils down to the assembly __builtin_parity() produces.
>> Clang 6.0.1 (and trunk on Godbolt) produces:
>>
>>
>>
>> parity(int):                             # @parity(int)
>>
>>         mov     eax, edi
>>
>>         shr     eax
>>
>>         and     eax, 1431655765
>>
>>         sub     edi, eax
>>
>>         mov     eax, edi
>>
>>         and     eax, 858993459
>>
>>         shr     edi, 2
>>
>>         and     edi, 858993459
>>
>>         add     edi, eax
>>
>>         mov     eax, edi
>>
>>         shr     eax, 4
>>
>>         add     eax, edi
>>
>>         and     eax, 17764111
>>
>>         imul    eax, eax, 16843009
>>
>>         shr     eax, 24
>>
>>         and     eax, 1
>>
>>         ret
>>
>>
>>
>> While GCC 8.1.0 (and trunk on Godbolt) produces
>>
>>
>>
>> parity(int):
>>
>>         mov     eax, edi
>>
>>         shr     edi, 16
>>
>>         xor     eax, edi
>>
>>         xor     al, ah
>>
>>         setnp   al
>>
>>         movzx   eax, al
>>
>>         ret
>>
>>
>>
>> I know a popcnt followed by an and would be better, but unfortunately
>> some of my users don’t have computers that support the popcnt instruction,
>> so I can’t use a newer -march flag.
>>
>>
>>
>> Could someone explain why the difference between Clang and GCC here, and
>> whether it should make a difference? The code in question is in a hot loop
>> in my code, so I’d imagine the size difference could impact unrolling (and
>> result in icache differences too), but I haven’t finished poking around
>> with benchmarks.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Alex
>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180723/761767fd/attachment.html>