[cfe-dev] __builtin_parity assembly without popcnt
via cfe-dev
cfe-dev at lists.llvm.org
Mon Jul 23 07:21:15 PDT 2018
This looks (not closely checked) like it's implementing the first part of pop1() combined with the last part of pop3(). Which is suggested in the comment just after pop1().
http://www.hackersdelight.org/hdcodetxt/pop.c.txt
And then anding with 1, obviously.
I wonder if division is good enough on modern machines to make pop2() faster.
XORing down to a byte and then using the x86 built in parity flag is obviously better if you are on an x86, of course. Other machines don't usually have that.
Oh, using an x86-specific feature would explain a lot. Actually gives me some ideas as to some more possible microoptimizations…
Thanks to you (and Eli) for the explanation!
On Fri, Jul 20, 2018 at 11:02 AM, via cfe-dev <cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org> > wrote:
Hello all!
I’ve been mucking around in an old codebase at work looking for easy performance wins. One avenue involves replacing a switch-based variable assignment with something derived from the parity of an input variable. I was pretty surprised when I saw the generated assembly, and I’m wondering about the reasoning behind it.
In short, it boils down to the assembly __builtin_parity() produces. Clang 6.0.1 (and trunk on Godbolt) produces:
parity(int): # @parity(int)
mov eax, edi
shr eax
and eax, 1431655765
sub edi, eax
mov eax, edi
and eax, 858993459
shr edi, 2
and edi, 858993459
add edi, eax
mov eax, edi
shr eax, 4
add eax, edi
and eax, 17764111
imul eax, eax, 16843009
shr eax, 24
and eax, 1
ret
While GCC 8.1.0 (and trunk on Godbolt) produces
parity(int):
mov eax, edi
shr edi, 16
xor eax, edi
xor al, ah
setnp al
movzx eax, al
ret
I know a popcnt followed by an and would be better, but unfortunately some of my users don’t have computers that support the popcnt instruction, so I can’t use a newer -march flag.
Could someone explain why the difference between Clang and GCC here, and whether it should make a difference? The code in question is in a hot loop in my code, so I’d imagine the size difference could impact unrolling (and result in icache differences too), but I haven’t finished poking around with benchmarks.
Thanks,
Alex
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org <mailto:cfe-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20180723/b9f24b80/attachment.html>
More information about the cfe-dev
mailing list