[cfe-dev] __builtin_parity assembly without popcnt

Friedman, Eli via cfe-dev cfe-dev at lists.llvm.org
Fri Jul 20 11:36:10 PDT 2018


On 7/20/2018 11:02 AM, via cfe-dev wrote:
>
> Hello all!
>
> I’ve been mucking around in an old codebase at work looking for easy 
> performance wins. One avenue involves replacing a switch-based 
> variable assignment with something derived from the parity of an input 
> variable. I was pretty surprised when I saw the generated assembly, 
> and I’m wondering about the reasoning behind it.
>
> In short, it boils down to the assembly __builtin_parity() produces. 
> Clang 6.0.1 (and trunk on Godbolt) produces:
>
> parity(int): # @parity(int)
>
> mov     eax, edi
>
> shr     eax
>
> and     eax, 1431655765
>
> sub     edi, eax
>
> mov     eax, edi
>
> and     eax, 858993459
>
> shr     edi, 2
>
> and     edi, 858993459
>
> add     edi, eax
>
> mov     eax, edi
>
> shr     eax, 4
>
> add     eax, edi
>
> and     eax, 17764111
>
> imul    eax, eax, 16843009
>
> shr     eax, 24
>
> and     eax, 1
>
> ret
>
> While GCC 8.1.0 (and trunk on Godbolt) produces
>
> parity(int):
>
> mov     eax, edi
>
> shr     edi, 16
>
> xor     eax, edi
>
> xor     al, ah
>
> setnp   al
>
> movzx   eax, al
>
> ret
>
> I know a popcnt followed by an and would be better, but unfortunately 
> some of my users don’t have computers that support the popcnt 
> instruction, so I can’t use a newer -march flag.
>
> Could someone explain why the difference between Clang and GCC here, 
> and whether it should make a difference? The code in question is in a 
> hot loop in my code, so I’d imagine the size difference could impact 
> unrolling (and result in icache differences too), but I haven’t 
> finished poking around with benchmarks.
>

LLVM doesn't have any special support for computing parity, so it's just 
getting lowered to "popcount(x)&1"; if your target doesn't have a 
popcount instruction, it uses the generic expansion (something like 
https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel 
).  This is obviously not the fastest lowering, but computing parity is 
not a common operation, so nobody has spent any time optimizing it.

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180720/1f0f0ffa/attachment-0001.html>


More information about the cfe-dev mailing list