[cfe-dev] "Optimized implementations"?

Wed Sep 9 11:03:02 PDT 2020

Those two lines of code are slightly different. The first asumes sr to be
0-63. The second allows sr to be 0-127. So we might end up with a runtime
check on bit 6 and a cmove to handle 127-64 and 63-0 differently. Maybe the
compiler figures out from the surrounding code that bit 6 is 0, but I'd
have to check.

   foo.high = (foo.high << sr) | (foo.low >> (64 - sr));
instead of just
    foo.all <<= sr,

~Craig

On Wed, Sep 9, 2020 at 10:53 AM Stefan Kanthak <stefan.kanthak at nexgo.de>
wrote:

> "Craig Topper" <craig.topper at gmail.com> wrote:
>
> > Currently we skip the build of __udivmodti4(), __multi3(), __modti3()
> etc.
> > when compiling compiler-rt with MSVC. The problem is more than just the
> > ABI. A bunch of the C code also assumes for example that you can do >> on
> > __uint128_t.
>
> Correct.
> OTOH, at least udivmodti4.c contains a whole lot of rather ugly "compound"
> shifts
>     foo.high = (foo.high << sr) | (foo.low >> (64 - sr));
> instead of just
>     foo.all <<= sr,
>
> The same holds of course for udivmoddi4.c too!
> JFTR: ANSI/ISO C introduced 64-bit data types in the last millennium.
>
> Again my question: who reviews such code?
> Let alone its poor^Wdisastrous performance...
>
> > But if the __uint128_t is a struct as would be needed for MSVC
> > that code won't compile. Similar for addition, subtraction, etc.
>
> I know (not that I tried, reading alone is sufficient to notice that).
>
> > Relevant code from the compiler-rt where we disable 128-bit.
> >
> > // MSVC doesn't have a working 128bit integer type. Users should really
> > compile
> > // compiler-rt with clang, but if they happen to be doing a standalone
> > build for
> > // asan or something else, disable the 128 bit parts so things sort of
> work.
> > #if defined(_MSC_VER) && !defined(__clang__)
> > #undef CRT_HAS_128BIT
> > #endif
>
> I looked only at COMPILER_RT_ABI and didn't notice that #undef
>
> Stefan
>
> > On Wed, Sep 9, 2020 at 10:27 AM Stefan Kanthak <stefan.kanthak at nexgo.de>
> > wrote:
> >
> >> "Craig Topper" <craig.topper at gmail.com> wrote:
> >>
> >> I should have added "rhetorical" to my alternative question. Sorry.
> >> The main question is but still open: is there a (deeper) reason to not
> >> always generate the optimised code for __builtins_*?
> >>
> >> > The code in the builtins library is written in C and can't use
> >> > __builtin_parity since it needs to compile with compilers that don't
> have
> >> > __builtin_parity like MSVC.
> >>
> >> Aahhhhh: how do you build the __udivmodti4(), __multi3(), __modti3()
> etc.
> >> routines for clang_rt.builtins-x86_64.lib with MSVC?
> >> As you know by now, there are some fidgety but strictly necessary
> details
> >> missing in the declaration of functions returning __uint128_t or
> >> __int128_t!
> >>
> >> regards
> >> Stefan
> >>
> >> > So to get the optimized code, we would have to
> >> > have a version written in assembly for X86. Probably two assembly
> >> versions
> >> > since MASM and gcc use different assembly syntax. So we'd have 3
> >> different
> >> > versions of parity for 3 different bit widths. Maybe some macros could
> >> > allow us to share some bit widths or something. But ultimately it was
> a
> >> > question of where to spend effort.
> >> >
> >> > ~Craig
> >> >
> >> >
> >> > On Wed, Sep 9, 2020 at 9:08 AM Stefan Kanthak <
> stefan.kanthak at nexgo.de>
> >> > wrote:
> >> >
> >> >> "Craig Topper" <craig.topper at gmail.com> wrote:
> >> >>
> >> >> > Turn on optimizations.
> >> >>
> >> >> What's the (deeper) reason NOT always to generate the optimised code
> for
> >> >> __builtins_*?
> >> >> Alternative question: do you also ship a library with unoptimised
> code
> >> >> which
> >> >> is linked when the -O option is not given?
> >> >>
> >> >> Just wondering about surprising behaviour
> >> >> Stefan
> >> >>
> >> >> > On Wed, Sep 9, 2020 at 4:28 AM Stefan Kanthak <
> >> stefan.kanthak at nexgo.de>
> >> >> > wrote:
> >> >> >
> >> >> >> "Craig Topper" <craig.topper at gmail.com> wrote:
> >> >> >>
> >> >> >> > __builtin_parity uses setnp on older x86 and popcnt with sse4.2
> >> >> >>
> >> >> >> Reality check, PLEASE:
> >> >> >>
> >> >> >> --- bug.c ---
> >> >> >> int main(int argc, char *argv[]) {
> >> >> >>     return __builtin_parity(argc);
> >> >> >> }
> >> >> >> --- EOF ---
> >> >> >>
> >> >> >> clang -o- -target i386-pc-linux -S bug.c
> >> >> >> clang version 10.0.0
> >> >> >> Target: i386-pc-linux
> >> >> >>
> >> >> >>         pushl   %ebp
> >> >> >>         movl    %esp, %ebp
> >> >> >>         subl    $8, %esp
> >> >> >>         movl    8(%ebp), %eax
> >> >> >>         movl    $0, -4(%ebp)
> >> >> >>         movl    8(%ebp), %ecx
> >> >> >>         movl    %ecx, %edx
> >> >> >>         shrl    %edx
> >> >> >>         andl    $1431655765, %edx       # imm = 0x55555555
> >> >> >>         subl    %edx, %ecx
> >> >> >>         movl    %ecx, %edx
> >> >> >>         andl    $858993459, %edx        # imm = 0x33333333
> >> >> >>         shrl    $2, %ecx
> >> >> >>         andl    $858993459, %ecx        # imm = 0x33333333
> >> >> >>         addl    %ecx, %edx
> >> >> >>         movl    %edx, %ecx
> >> >> >>         shrl    $4, %ecx
> >> >> >>         addl    %ecx, %edx
> >> >> >>         andl    $252645135, %edx        # imm = 0xF0F0F0F
> >> >> >>         imull   $16843009, %edx, %ecx   # imm = 0x1010101
> >> >> >>         shrl    $24, %ecx
> >> >> >>         andl    $1, %ecx
> >> >> >>         movl    %eax, -8(%ebp)          # 4-byte Spill
> >> >> >>         movl    %ecx, %eax
> >> >> >>         addl    $8, %esp
> >> >> >>         popl    %ebp
> >> >> >>         retl
> >> >> >>
> >> >> >>
> >> >> >> clang -o- -target amd64-pc-linux -S bug.c
> >> >> >>
> >> >> >>         pushq   %rbp
> >> >> >>         .cfi_def_cfa_offset 16
> >> >> >>         .cfi_offset %rbp, -16
> >> >> >>         movq    %rsp, %rbp
> >> >> >>         .cfi_def_cfa_register %rbp
> >> >> >>         movl    $0, -4(%rbp)
> >> >> >>         movl    %edi, -8(%rbp)
> >> >> >>         movl    -8(%rbp), %eax
> >> >> >>         movl    %eax, %ecx
> >> >> >>         shrl    %ecx
> >> >> >>         andl    $1431655765, %ecx       # imm = 0x55555555
> >> >> >>         subl    %ecx, %eax
> >> >> >>         movl    %eax, %ecx
> >> >> >>         andl    $858993459, %ecx        # imm = 0x33333333
> >> >> >>         shrl    $2, %eax
> >> >> >>         andl    $858993459, %eax        # imm = 0x33333333
> >> >> >>         addl    %eax, %ecx
> >> >> >>         movl    %ecx, %eax
> >> >> >>         shrl    $4, %eax
> >> >> >>         addl    %eax, %ecx
> >> >> >>         andl    $252645135, %ecx        # imm = 0xF0F0F0F
> >> >> >>         imull   $16843009, %ecx, %eax   # imm = 0x1010101
> >> >> >>         shrl    $24, %eax
> >> >> >>         andl    $1, %eax
> >> >> >>         popq    %rbp
> >> >> >>         .cfi_def_cfa %rsp, 8
> >> >> >>         retq
> >> >> >>
> >> >> >> JFTR: this is the same unoptimised code as shipped in the builtins
> >> >> library!
> >> >> >>
> >> >> >> Stefan
> >> >> >>
> >> >> >> > On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <
> >> >> stefan.kanthak at nexgo.de>
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> >> "Craig Topper" <craig.topper at gmail.com> wrote;
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> > Clang never generates calls to ___paritysi2, ___paritydi2,
> >> >> ___cmpdi2,
> >> >> >> or
> >> >> >> >>
> >> >> >> >> > ___ucmpdi2 on X86 so its not clear the performance of this
> >> matters
> >> >> at
> >> >> >> >> all.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> So you can safely remove them for X86 and all the other targets
> >> where
> >> >> >> such
> >> >> >> >>
> >> >> >> >> unoptimized code is never called!
> >> >> >> >>
> >> >> >> >> But fix these routines for targets where they are called.
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> The statement does NOT make any exceptions, and it does not say
> >> >> >> >>
> >> >> >> >> | ships unoptimized routines the compiler never calls
> >> >> >> >>
> >> >> >> >> but
> >> >> >> >>
> >> >> >> >> | optimized target-independent implementations
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> Stefan
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> BTW: do builtins like __builtin_*parity* exist?
> >> >> >> >>
> >> >> >> >>      If yes: do they generate the same bad code?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> > On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <
> >> >> >> >>
> >> >> >> >> > cfe-dev at lists.llvm.org> wrote:
> >> >> >> >>
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> >> <https://compiler-rt.llvm.org/index.html> boasts:
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >> | The builtins library provides optimized implementations of
> >> this
> >> >> >> >>
> >> >> >> >> >> | and other low-level routines, either in
> target-independent C
> >> >> form,
> >> >> >> >>
> >> >> >> >> >> | or as a heavily-optimized assembly.
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >> Really?
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >> Left: inperformant code shipped in    # Right: slightly
> >> improved
> >> >> >> code,
> >> >> >> >>
> >> >> >> >> >>       clang_rt.builtins-*             #        which the
> >> optimiser
> >> >> >> >> REALLY
> >> >> >> >>
> >> >> >> >> >>                                       #        should have
> >> >> generated
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >> ___cmpdi2:
> >> >> >> >>
> >> >> >> >> >>         mov     ecx, [esp+16]         #       mov     ecx,
> >> >> [esp+16]
> >> >> >> >>
> >> >> >> >> >>         xor     eax, eax              #       xor     eax,
> eax
> >> >> >> >>
> >> >> >> >> >>         cmp     [esp+8], ecx          #       cmp     ecx,
> >> [esp+8]
> >> >> >> >>
> >> >> >> >> >>         jl      @f                    #       jg      @f
> >> >> >> >>
> >> >> >> >> >>         mov     eax, 2                #       mov     eax, 2
> >> >> >> >>
> >> >> >> >> >>         jg      @f                    #       jl      @f
> >> >> >> >>
> >> >> >> >> >>         mov     ecx, [esp+4]          #
> >> >> >> >>
> >> >> >> >> >>         mov     edx, [esp+12]         #       mov     ecx,
> >> >> [esp+12]
> >> >> >> >>
> >> >> >> >> >>         mov     eax, 0                #       xor     eax,
> eax
> >> >> >> >>
> >> >> >> >> >>         cmp     ecx, edx              #       cmp     ecx,
> >> [esp+4]
> >> >> >> >>
> >> >> >> >> >>         jb      @f                    #       ja      @f
> >> >> >> >>
> >> >> >> >> >>         cmp     edx, ecx              #
> >> >> >> >>
> >> >> >> >> >>         mov     eax, 1                #
> >> >> >> >>
> >> >> >> >> >>         adc     eax, 0                #       adc     eax, 1
> >> >> >> >>
> >> >> >> >> >> @@:                                   # @@:
> >> >> >> >>
> >> >> >> >> >>         ret                           #       ret
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >>                                       # 3 instructions
> less, 10
> >> >> bytes
> >> >> >> >> saved
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >> ___ucmpdi2:
> >> >> >> >>
> >> >> >> >> >>         mov     ecx, [esp+16]         #       mov     ecx,
> >> >> [esp+16]
> >> >> >> >>
> >> >> >> >> >>         xor     eax, eax              #       xor     eax,
> eax
> >> >> >> >>
> >> >> >> >> >>         cmp     [esp+8], ecx          #       cmp     ecx,
> >> [esp+8]
> >> >> >> >>
> >> >> >> >> >>         jb      @f                    #       ja      @f
> >> >> >> >>
> >> >> >> >> >>         mov     eax, 2                #       mov     eax, 2
> >> >> >> >>
> >> >> >> >> >>         ja      @f                    #       jb      @f
> >> >> >> >>
> >> >> >> >> >>         mov     ecx, [esp+4]          #
> >> >> >> >>
> >> >> >> >> >>         mov     edx, [esp+12]         #       mov     ecx,
> >> >> [esp+12]
> >> >> >> >>
> >> >> >> >> >>         mov     eax, 0                #       xor     eax,
> eax
> >> >> >> >>
> >> >> >> >> >>         cmp     ecx, edx              #       cmp     ecx,
> >> [esp+4]
> >> >> >> >>
> >> >> >> >> >>         jb      @f                    #       ja      @f
> >> >> >> >>
> >> >> >> >> >>         cmp     edx, ecx              #
> >> >> >> >>
> >> >> >> >> >>         mov     eax, 1                #
> >> >> >> >>
> >> >> >> >> >>         adc     eax, 0                #       adc     eax, 1
> >> >> >> >>
> >> >> >> >> >> @@:                                   # @@:
> >> >> >> >>
> >> >> >> >> >>         ret                           #       ret
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >>                                       # 3 instructions
> less, 10
> >> >> bytes
> >> >> >> >> saved
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >> Now properly written code, of course branch-free, faster and
> >> >> shorter:
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >> # Copyright (C) 2004-2020, Stefan Kanthak <
> >> >> stefan.kanthak at nexgo.de>
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >> ___cmpdi2:
> >> >> >> >>
> >> >> >> >> >>         mov     ecx, [esp+4]
> >> >> >> >>
> >> >> >> >> >>         mov     edx, [esp+12]
> >> >> >> >>
> >> >> >> >> >>         cmp     ecx, edx
> >> >> >> >>
> >> >> >> >> >>         mov     eax, [esp+8]
> >> >> >> >>
> >> >> >> >> >>         sbb     eax, [esp+16]
> >> >> >> >>
> >> >> >> >> >>         setl    ah
> >> >> >> >>
> >> >> >> >> >>         cmp     edx, ecx
> >> >> >> >>
> >> >> >> >> >>         mov     edx, [esp+16]
> >> >> >> >>
> >> >> >> >> >>         sbb     edx, [esp+8]
> >> >> >> >>
> >> >> >> >> >>         setl    al
> >> >> >> >>
> >> >> >> >> >>         sub     al, ah
> >> >> >> >>
> >> >> >> >> >>         movsx   eax, al
> >> >> >> >>
> >> >> >> >> >>         inc     eax
> >> >> >> >>
> >> >> >> >> >>         ret
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >> ___ucmpdi2:
> >> >> >> >>
> >> >> >> >> >>         mov     ecx, [esp+4]
> >> >> >> >>
> >> >> >> >> >>         mov     edx, [esp+12]
> >> >> >> >>
> >> >> >> >> >>         cmp     ecx, edx
> >> >> >> >>
> >> >> >> >> >>         mov     eax, [esp+8]
> >> >> >> >>
> >> >> >> >> >>         sbb     eax, [esp+16]
> >> >> >> >>
> >> >> >> >> >>         sbb     eax, eax
> >> >> >> >>
> >> >> >> >> >>         cmp     edx, ecx
> >> >> >> >>
> >> >> >> >> >>         mov     edx, [esp+16]
> >> >> >> >>
> >> >> >> >> >>         sbb     edx, [esp+8]
> >> >> >> >>
> >> >> >> >> >>         adc     eax, 1
> >> >> >> >>
> >> >> >> >> >>         ret
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >> AGAIN:
> >> >> >> >>
> >> >> >> >> >> Remove every occurance of the word "optimized" on the above
> web
> >> >> page.
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >> >> 'nuff said
> >> >> >> >>
> >> >> >> >> >> Stefan
> >> >> >> >>
> >> >> >> >> >> _______________________________________________
> >> >> >> >>
> >> >> >> >> >> cfe-dev mailing list
> >> >> >> >>
> >> >> >> >> >> cfe-dev at lists.llvm.org
> >> >> >> >>
> >> >> >> >> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >> > --
> >> >> >> > ~Craig
> >> >> >> >
> >> >> >>
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200909/e9d81134/attachment-0001.html>