[cfe-dev] "Optimized implementations"?

Wed Sep 9 10:01:55 PDT 2020

I guess you were probably asking about __builtin_parity not the builtins
library. For __builtin_parity, the frontend emits a llvm.ctpop+ and w/ 1.
The backend has a peephole to pattern match this to the parity sequence.
But the peephole is part of a larger optimization pass that isn't
running in -O0 to save compile time. So it doesn't get pattern matched so
we emit the AND and an expand popcount. To fix this correctly for -O0 we
need to add a llvm.parity intrinsic that the frontend can emit directly
instead of the ctpop+and. Then we wouldn't need the peephole pattern match.

~Craig

On Wed, Sep 9, 2020 at 9:25 AM Craig Topper <craig.topper at gmail.com> wrote:

> The code in the builtins library is written in C and can't use
> __builtin_parity since it needs to compile with compilers that don't have
> __builtin_parity like MSVC. So to get the optimized code, we would have to
> have a version written in assembly for X86. Probably two assembly versions
> since MASM and gcc use different assembly syntax. So we'd have 3 different
> versions of parity for 3 different bit widths. Maybe some macros could
> allow us to share some bit widths or something. But ultimately it was a
> question of where to spend effort.
>
> ~Craig
>
>
> On Wed, Sep 9, 2020 at 9:08 AM Stefan Kanthak <stefan.kanthak at nexgo.de>
> wrote:
>
>> "Craig Topper" <craig.topper at gmail.com> wrote:
>>
>> > Turn on optimizations.
>>
>> What's the (deeper) reason NOT always to generate the optimised code for
>> __builtins_*?
>> Alternative question: do you also ship a library with unoptimised code
>> which
>> is linked when the -O option is not given?
>>
>> Just wondering about surprising behaviour
>> Stefan
>>
>> > On Wed, Sep 9, 2020 at 4:28 AM Stefan Kanthak <stefan.kanthak at nexgo.de>
>> > wrote:
>> >
>> >> "Craig Topper" <craig.topper at gmail.com> wrote:
>> >>
>> >> > __builtin_parity uses setnp on older x86 and popcnt with sse4.2
>> >>
>> >> Reality check, PLEASE:
>> >>
>> >> --- bug.c ---
>> >> int main(int argc, char *argv[]) {
>> >>     return __builtin_parity(argc);
>> >> }
>> >> --- EOF ---
>> >>
>> >> clang -o- -target i386-pc-linux -S bug.c
>> >> clang version 10.0.0
>> >> Target: i386-pc-linux
>> >>
>> >>         pushl   %ebp
>> >>         movl    %esp, %ebp
>> >>         subl    $8, %esp
>> >>         movl    8(%ebp), %eax
>> >>         movl    $0, -4(%ebp)
>> >>         movl    8(%ebp), %ecx
>> >>         movl    %ecx, %edx
>> >>         shrl    %edx
>> >>         andl    $1431655765, %edx       # imm = 0x55555555
>> >>         subl    %edx, %ecx
>> >>         movl    %ecx, %edx
>> >>         andl    $858993459, %edx        # imm = 0x33333333
>> >>         shrl    $2, %ecx
>> >>         andl    $858993459, %ecx        # imm = 0x33333333
>> >>         addl    %ecx, %edx
>> >>         movl    %edx, %ecx
>> >>         shrl    $4, %ecx
>> >>         addl    %ecx, %edx
>> >>         andl    $252645135, %edx        # imm = 0xF0F0F0F
>> >>         imull   $16843009, %edx, %ecx   # imm = 0x1010101
>> >>         shrl    $24, %ecx
>> >>         andl    $1, %ecx
>> >>         movl    %eax, -8(%ebp)          # 4-byte Spill
>> >>         movl    %ecx, %eax
>> >>         addl    $8, %esp
>> >>         popl    %ebp
>> >>         retl
>> >>
>> >>
>> >> clang -o- -target amd64-pc-linux -S bug.c
>> >>
>> >>         pushq   %rbp
>> >>         .cfi_def_cfa_offset 16
>> >>         .cfi_offset %rbp, -16
>> >>         movq    %rsp, %rbp
>> >>         .cfi_def_cfa_register %rbp
>> >>         movl    $0, -4(%rbp)
>> >>         movl    %edi, -8(%rbp)
>> >>         movl    -8(%rbp), %eax
>> >>         movl    %eax, %ecx
>> >>         shrl    %ecx
>> >>         andl    $1431655765, %ecx       # imm = 0x55555555
>> >>         subl    %ecx, %eax
>> >>         movl    %eax, %ecx
>> >>         andl    $858993459, %ecx        # imm = 0x33333333
>> >>         shrl    $2, %eax
>> >>         andl    $858993459, %eax        # imm = 0x33333333
>> >>         addl    %eax, %ecx
>> >>         movl    %ecx, %eax
>> >>         shrl    $4, %eax
>> >>         addl    %eax, %ecx
>> >>         andl    $252645135, %ecx        # imm = 0xF0F0F0F
>> >>         imull   $16843009, %ecx, %eax   # imm = 0x1010101
>> >>         shrl    $24, %eax
>> >>         andl    $1, %eax
>> >>         popq    %rbp
>> >>         .cfi_def_cfa %rsp, 8
>> >>         retq
>> >>
>> >> JFTR: this is the same unoptimised code as shipped in the builtins
>> library!
>> >>
>> >> Stefan
>> >>
>> >> > On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <
>> stefan.kanthak at nexgo.de>
>> >> > wrote:
>> >> >
>> >> >> "Craig Topper" <craig.topper at gmail.com> wrote;
>> >> >>
>> >> >>
>> >> >>
>> >> >> > Clang never generates calls to ___paritysi2, ___paritydi2,
>> ___cmpdi2,
>> >> or
>> >> >>
>> >> >> > ___ucmpdi2 on X86 so its not clear the performance of this
>> matters at
>> >> >> all.
>> >> >>
>> >> >>
>> >> >>
>> >> >> So you can safely remove them for X86 and all the other targets
>> where
>> >> such
>> >> >>
>> >> >> unoptimized code is never called!
>> >> >>
>> >> >> But fix these routines for targets where they are called.
>> >> >>
>> >> >>
>> >> >>
>> >> >> The statement does NOT make any exceptions, and it does not say
>> >> >>
>> >> >> | ships unoptimized routines the compiler never calls
>> >> >>
>> >> >> but
>> >> >>
>> >> >> | optimized target-independent implementations
>> >> >>
>> >> >>
>> >> >>
>> >> >> Stefan
>> >> >>
>> >> >>
>> >> >>
>> >> >> BTW: do builtins like __builtin_*parity* exist?
>> >> >>
>> >> >>      If yes: do they generate the same bad code?
>> >> >>
>> >> >>
>> >> >>
>> >> >> > On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <
>> >> >>
>> >> >> > cfe-dev at lists.llvm.org> wrote:
>> >> >>
>> >> >> >
>> >> >>
>> >> >> >> <https://compiler-rt.llvm.org/index.html> boasts:
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> | The builtins library provides optimized implementations of this
>> >> >>
>> >> >> >> | and other low-level routines, either in target-independent C
>> form,
>> >> >>
>> >> >> >> | or as a heavily-optimized assembly.
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> Really?
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> Left: inperformant code shipped in    # Right: slightly improved
>> >> code,
>> >> >>
>> >> >> >>       clang_rt.builtins-*             #        which the
>> optimiser
>> >> >> REALLY
>> >> >>
>> >> >> >>                                       #        should have
>> generated
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> ___cmpdi2:
>> >> >>
>> >> >> >>         mov     ecx, [esp+16]         #       mov     ecx,
>> [esp+16]
>> >> >>
>> >> >> >>         xor     eax, eax              #       xor     eax, eax
>> >> >>
>> >> >> >>         cmp     [esp+8], ecx          #       cmp     ecx,
>> [esp+8]
>> >> >>
>> >> >> >>         jl      @f                    #       jg      @f
>> >> >>
>> >> >> >>         mov     eax, 2                #       mov     eax, 2
>> >> >>
>> >> >> >>         jg      @f                    #       jl      @f
>> >> >>
>> >> >> >>         mov     ecx, [esp+4]          #
>> >> >>
>> >> >> >>         mov     edx, [esp+12]         #       mov     ecx,
>> [esp+12]
>> >> >>
>> >> >> >>         mov     eax, 0                #       xor     eax, eax
>> >> >>
>> >> >> >>         cmp     ecx, edx              #       cmp     ecx,
>> [esp+4]
>> >> >>
>> >> >> >>         jb      @f                    #       ja      @f
>> >> >>
>> >> >> >>         cmp     edx, ecx              #
>> >> >>
>> >> >> >>         mov     eax, 1                #
>> >> >>
>> >> >> >>         adc     eax, 0                #       adc     eax, 1
>> >> >>
>> >> >> >> @@:                                   # @@:
>> >> >>
>> >> >> >>         ret                           #       ret
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >>                                       # 3 instructions less, 10
>> bytes
>> >> >> saved
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> ___ucmpdi2:
>> >> >>
>> >> >> >>         mov     ecx, [esp+16]         #       mov     ecx,
>> [esp+16]
>> >> >>
>> >> >> >>         xor     eax, eax              #       xor     eax, eax
>> >> >>
>> >> >> >>         cmp     [esp+8], ecx          #       cmp     ecx,
>> [esp+8]
>> >> >>
>> >> >> >>         jb      @f                    #       ja      @f
>> >> >>
>> >> >> >>         mov     eax, 2                #       mov     eax, 2
>> >> >>
>> >> >> >>         ja      @f                    #       jb      @f
>> >> >>
>> >> >> >>         mov     ecx, [esp+4]          #
>> >> >>
>> >> >> >>         mov     edx, [esp+12]         #       mov     ecx,
>> [esp+12]
>> >> >>
>> >> >> >>         mov     eax, 0                #       xor     eax, eax
>> >> >>
>> >> >> >>         cmp     ecx, edx              #       cmp     ecx,
>> [esp+4]
>> >> >>
>> >> >> >>         jb      @f                    #       ja      @f
>> >> >>
>> >> >> >>         cmp     edx, ecx              #
>> >> >>
>> >> >> >>         mov     eax, 1                #
>> >> >>
>> >> >> >>         adc     eax, 0                #       adc     eax, 1
>> >> >>
>> >> >> >> @@:                                   # @@:
>> >> >>
>> >> >> >>         ret                           #       ret
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >>                                       # 3 instructions less, 10
>> bytes
>> >> >> saved
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> Now properly written code, of course branch-free, faster and
>> shorter:
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> # Copyright (C) 2004-2020, Stefan Kanthak <
>> stefan.kanthak at nexgo.de>
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> ___cmpdi2:
>> >> >>
>> >> >> >>         mov     ecx, [esp+4]
>> >> >>
>> >> >> >>         mov     edx, [esp+12]
>> >> >>
>> >> >> >>         cmp     ecx, edx
>> >> >>
>> >> >> >>         mov     eax, [esp+8]
>> >> >>
>> >> >> >>         sbb     eax, [esp+16]
>> >> >>
>> >> >> >>         setl    ah
>> >> >>
>> >> >> >>         cmp     edx, ecx
>> >> >>
>> >> >> >>         mov     edx, [esp+16]
>> >> >>
>> >> >> >>         sbb     edx, [esp+8]
>> >> >>
>> >> >> >>         setl    al
>> >> >>
>> >> >> >>         sub     al, ah
>> >> >>
>> >> >> >>         movsx   eax, al
>> >> >>
>> >> >> >>         inc     eax
>> >> >>
>> >> >> >>         ret
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> ___ucmpdi2:
>> >> >>
>> >> >> >>         mov     ecx, [esp+4]
>> >> >>
>> >> >> >>         mov     edx, [esp+12]
>> >> >>
>> >> >> >>         cmp     ecx, edx
>> >> >>
>> >> >> >>         mov     eax, [esp+8]
>> >> >>
>> >> >> >>         sbb     eax, [esp+16]
>> >> >>
>> >> >> >>         sbb     eax, eax
>> >> >>
>> >> >> >>         cmp     edx, ecx
>> >> >>
>> >> >> >>         mov     edx, [esp+16]
>> >> >>
>> >> >> >>         sbb     edx, [esp+8]
>> >> >>
>> >> >> >>         adc     eax, 1
>> >> >>
>> >> >> >>         ret
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> AGAIN:
>> >> >>
>> >> >> >> Remove every occurance of the word "optimized" on the above web
>> page.
>> >> >>
>> >> >> >>
>> >> >>
>> >> >> >> 'nuff said
>> >> >>
>> >> >> >> Stefan
>> >> >>
>> >> >> >> _______________________________________________
>> >> >>
>> >> >> >> cfe-dev mailing list
>> >> >>
>> >> >> >> cfe-dev at lists.llvm.org
>> >> >>
>> >> >> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >> >>
>> >> >> >>
>> >> >>
>> >> >>
>> >> >
>> >> > --
>> >> > ~Craig
>> >> >
>> >>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200909/90691d3a/attachment-0001.html>