[cfe-dev] __builtin_parity assembly without popcnt
Benjamin Kramer via cfe-dev
cfe-dev at lists.llvm.org
Mon Jul 23 09:57:39 PDT 2018
I thought the parity flag only affects the last byte, meaning GCC's
lowering is optimal.
On Mon, Jul 23, 2018 at 6:31 PM Craig Topper via cfe-dev <
cfe-dev at lists.llvm.org> wrote:
> Why does gcc xor down to a byte? Can't you get the parity flag for the
> whole value by just oring with itself and grabbing the flag?
>
> ~Craig
>
>
> On Fri, Jul 20, 2018 at 11:37 AM Bruce Hoult via cfe-dev <
> cfe-dev at lists.llvm.org> wrote:
>
>> This looks (not closely checked) like it's implementing the first part of
>> pop1() combined with the last part of pop3(). Which is suggested in the
>> comment just after pop1().
>>
>> http://www.hackersdelight.org/hdcodetxt/pop.c.txt
>>
>> And then anding with 1, obviously.
>>
>> I wonder if division is good enough on modern machines to make pop2()
>> faster.
>>
>> XORing down to a byte and then using the x86 built in parity flag is
>> obviously better if you are on an x86, of course. Other machines don't
>> usually have that.
>>
>>
>> On Fri, Jul 20, 2018 at 11:02 AM, via cfe-dev <cfe-dev at lists.llvm.org>
>> wrote:
>>
>>> Hello all!
>>>
>>>
>>>
>>> I’ve been mucking around in an old codebase at work looking for easy
>>> performance wins. One avenue involves replacing a switch-based variable
>>> assignment with something derived from the parity of an input variable. I
>>> was pretty surprised when I saw the generated assembly, and I’m wondering
>>> about the reasoning behind it.
>>>
>>>
>>>
>>> In short, it boils down to the assembly __builtin_parity() produces.
>>> Clang 6.0.1 (and trunk on Godbolt) produces:
>>>
>>>
>>>
>>> parity(int): # @parity(int)
>>>
>>> mov eax, edi
>>>
>>> shr eax
>>>
>>> and eax, 1431655765
>>>
>>> sub edi, eax
>>>
>>> mov eax, edi
>>>
>>> and eax, 858993459
>>>
>>> shr edi, 2
>>>
>>> and edi, 858993459
>>>
>>> add edi, eax
>>>
>>> mov eax, edi
>>>
>>> shr eax, 4
>>>
>>> add eax, edi
>>>
>>> and eax, 17764111
>>>
>>> imul eax, eax, 16843009
>>>
>>> shr eax, 24
>>>
>>> and eax, 1
>>>
>>> ret
>>>
>>>
>>>
>>> While GCC 8.1.0 (and trunk on Godbolt) produces
>>>
>>>
>>>
>>> parity(int):
>>>
>>> mov eax, edi
>>>
>>> shr edi, 16
>>>
>>> xor eax, edi
>>>
>>> xor al, ah
>>>
>>> setnp al
>>>
>>> movzx eax, al
>>>
>>> ret
>>>
>>>
>>>
>>> I know a popcnt followed by an and would be better, but unfortunately
>>> some of my users don’t have computers that support the popcnt instruction,
>>> so I can’t use a newer -march flag.
>>>
>>>
>>>
>>> Could someone explain why the difference between Clang and GCC here, and
>>> whether it should make a difference? The code in question is in a hot loop
>>> in my code, so I’d imagine the size difference could impact unrolling (and
>>> result in icache differences too), but I haven’t finished poking around
>>> with benchmarks.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Alex
>>>
>>> _______________________________________________
>>> cfe-dev mailing list
>>> cfe-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>>
>>>
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180723/67ccd5a7/attachment.html>
More information about the cfe-dev
mailing list