[cfe-dev] __builtin_parity assembly without popcnt
via cfe-dev
cfe-dev at lists.llvm.org
Fri Jul 20 11:02:11 PDT 2018
Hello all!
I've been mucking around in an old codebase at work looking for easy
performance wins. One avenue involves replacing a switch-based variable
assignment with something derived from the parity of an input variable. I
was pretty surprised when I saw the generated assembly, and I'm wondering
about the reasoning behind it.
In short, it boils down to the assembly __builtin_parity() produces. Clang
6.0.1 (and trunk on Godbolt) produces:
parity(int): # @parity(int)
mov eax, edi
shr eax
and eax, 1431655765
sub edi, eax
mov eax, edi
and eax, 858993459
shr edi, 2
and edi, 858993459
add edi, eax
mov eax, edi
shr eax, 4
add eax, edi
and eax, 17764111
imul eax, eax, 16843009
shr eax, 24
and eax, 1
ret
While GCC 8.1.0 (and trunk on Godbolt) produces
parity(int):
mov eax, edi
shr edi, 16
xor eax, edi
xor al, ah
setnp al
movzx eax, al
ret
I know a popcnt followed by an and would be better, but unfortunately some
of my users don't have computers that support the popcnt instruction, so I
can't use a newer -march flag.
Could someone explain why the difference between Clang and GCC here, and
whether it should make a difference? The code in question is in a hot loop
in my code, so I'd imagine the size difference could impact unrolling (and
result in icache differences too), but I haven't finished poking around with
benchmarks.
Thanks,
Alex
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180720/2e9ce83a/attachment.html>
More information about the cfe-dev
mailing list