[cfe-dev] __builtin_parity assembly without popcnt

Mon Jul 23 09:57:39 PDT 2018

I thought the parity flag only affects the last byte, meaning GCC's
lowering is optimal.

On Mon, Jul 23, 2018 at 6:31 PM Craig Topper via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> Why does gcc xor down to a byte? Can't you get the parity flag for the
> whole value by just oring with itself and grabbing the flag?
>
> ~Craig
>
>
> On Fri, Jul 20, 2018 at 11:37 AM Bruce Hoult via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> This looks (not closely checked) like it's implementing the first part of
>> pop1() combined with the last part of pop3(). Which is suggested in the
>> comment just after pop1().
>>
>> http://www.hackersdelight.org/hdcodetxt/pop.c.txt
>>
>> And then anding with 1, obviously.
>>
>> I wonder if division is good enough on modern machines to make pop2()
>> faster.
>>
>> XORing down to a byte and then using the x86 built in parity flag is
>> obviously better if you are on an x86, of course. Other machines don't
>> usually have that.
>>
>>
>> On Fri, Jul 20, 2018 at 11:02 AM, via cfe-dev <cfe-dev at lists.llvm.org>
>> wrote:
>>
>>> Hello all!
>>>
>>>
>>>
>>> I’ve been mucking around in an old codebase at work looking for easy
>>> performance wins. One avenue involves replacing a switch-based variable
>>> assignment with something derived from the parity of an input variable. I
>>> was pretty surprised when I saw the generated assembly, and I’m wondering
>>> about the reasoning behind it.
>>>
>>>
>>>
>>> In short, it boils down to the assembly __builtin_parity() produces.
>>> Clang 6.0.1 (and trunk on Godbolt) produces:
>>>
>>>
>>>
>>> parity(int):                             # @parity(int)
>>>
>>>         mov     eax, edi
>>>
>>>         shr     eax
>>>
>>>         and     eax, 1431655765
>>>
>>>         sub     edi, eax
>>>
>>>         mov     eax, edi
>>>
>>>         and     eax, 858993459
>>>
>>>         shr     edi, 2
>>>
>>>         and     edi, 858993459
>>>
>>>         add     edi, eax
>>>
>>>         mov     eax, edi
>>>
>>>         shr     eax, 4
>>>
>>>         add     eax, edi
>>>
>>>         and     eax, 17764111
>>>
>>>         imul    eax, eax, 16843009
>>>
>>>         shr     eax, 24
>>>
>>>         and     eax, 1
>>>
>>>         ret
>>>
>>>
>>>
>>> While GCC 8.1.0 (and trunk on Godbolt) produces
>>>
>>>
>>>
>>> parity(int):
>>>
>>>         mov     eax, edi
>>>
>>>         shr     edi, 16
>>>
>>>         xor     eax, edi
>>>
>>>         xor     al, ah
>>>
>>>         setnp   al
>>>
>>>         movzx   eax, al
>>>
>>>         ret
>>>
>>>
>>>
>>> I know a popcnt followed by an and would be better, but unfortunately
>>> some of my users don’t have computers that support the popcnt instruction,
>>> so I can’t use a newer -march flag.
>>>
>>>
>>>
>>> Could someone explain why the difference between Clang and GCC here, and
>>> whether it should make a difference? The code in question is in a hot loop
>>> in my code, so I’d imagine the size difference could impact unrolling (and
>>> result in icache differences too), but I haven’t finished poking around
>>> with benchmarks.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Alex
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180723/67ccd5a7/attachment.html>