[cfe-dev] "Optimized implementations"?
Craig Topper via cfe-dev
cfe-dev at lists.llvm.org
Wed Sep 9 11:03:40 PDT 2020
Sorry the first allows sr to be 1-63. The 64-sr makes 0 not allowed.
~Craig
On Wed, Sep 9, 2020 at 11:03 AM Craig Topper <craig.topper at gmail.com> wrote:
> Those two lines of code are slightly different. The first asumes sr to be
> 0-63. The second allows sr to be 0-127. So we might end up with a runtime
> check on bit 6 and a cmove to handle 127-64 and 63-0 differently. Maybe the
> compiler figures out from the surrounding code that bit 6 is 0, but I'd
> have to check.
>
> foo.high = (foo.high << sr) | (foo.low >> (64 - sr));
> instead of just
> foo.all <<= sr,
>
>
> ~Craig
>
>
> On Wed, Sep 9, 2020 at 10:53 AM Stefan Kanthak <stefan.kanthak at nexgo.de>
> wrote:
>
>> "Craig Topper" <craig.topper at gmail.com> wrote:
>>
>> > Currently we skip the build of __udivmodti4(), __multi3(), __modti3()
>> etc.
>> > when compiling compiler-rt with MSVC. The problem is more than just the
>> > ABI. A bunch of the C code also assumes for example that you can do >>
>> on
>> > __uint128_t.
>>
>> Correct.
>> OTOH, at least udivmodti4.c contains a whole lot of rather ugly "compound"
>> shifts
>> foo.high = (foo.high << sr) | (foo.low >> (64 - sr));
>> instead of just
>> foo.all <<= sr,
>>
>> The same holds of course for udivmoddi4.c too!
>> JFTR: ANSI/ISO C introduced 64-bit data types in the last millennium.
>>
>> Again my question: who reviews such code?
>> Let alone its poor^Wdisastrous performance...
>>
>> > But if the __uint128_t is a struct as would be needed for MSVC
>> > that code won't compile. Similar for addition, subtraction, etc.
>>
>> I know (not that I tried, reading alone is sufficient to notice that).
>>
>> > Relevant code from the compiler-rt where we disable 128-bit.
>> >
>> > // MSVC doesn't have a working 128bit integer type. Users should really
>> > compile
>> > // compiler-rt with clang, but if they happen to be doing a standalone
>> > build for
>> > // asan or something else, disable the 128 bit parts so things sort of
>> work.
>> > #if defined(_MSC_VER) && !defined(__clang__)
>> > #undef CRT_HAS_128BIT
>> > #endif
>>
>> I looked only at COMPILER_RT_ABI and didn't notice that #undef
>>
>> Stefan
>>
>> > On Wed, Sep 9, 2020 at 10:27 AM Stefan Kanthak <stefan.kanthak at nexgo.de
>> >
>> > wrote:
>> >
>> >> "Craig Topper" <craig.topper at gmail.com> wrote:
>> >>
>> >> I should have added "rhetorical" to my alternative question. Sorry.
>> >> The main question is but still open: is there a (deeper) reason to not
>> >> always generate the optimised code for __builtins_*?
>> >>
>> >> > The code in the builtins library is written in C and can't use
>> >> > __builtin_parity since it needs to compile with compilers that don't
>> have
>> >> > __builtin_parity like MSVC.
>> >>
>> >> Aahhhhh: how do you build the __udivmodti4(), __multi3(), __modti3()
>> etc.
>> >> routines for clang_rt.builtins-x86_64.lib with MSVC?
>> >> As you know by now, there are some fidgety but strictly necessary
>> details
>> >> missing in the declaration of functions returning __uint128_t or
>> >> __int128_t!
>> >>
>> >> regards
>> >> Stefan
>> >>
>> >> > So to get the optimized code, we would have to
>> >> > have a version written in assembly for X86. Probably two assembly
>> >> versions
>> >> > since MASM and gcc use different assembly syntax. So we'd have 3
>> >> different
>> >> > versions of parity for 3 different bit widths. Maybe some macros
>> could
>> >> > allow us to share some bit widths or something. But ultimately it
>> was a
>> >> > question of where to spend effort.
>> >> >
>> >> > ~Craig
>> >> >
>> >> >
>> >> > On Wed, Sep 9, 2020 at 9:08 AM Stefan Kanthak <
>> stefan.kanthak at nexgo.de>
>> >> > wrote:
>> >> >
>> >> >> "Craig Topper" <craig.topper at gmail.com> wrote:
>> >> >>
>> >> >> > Turn on optimizations.
>> >> >>
>> >> >> What's the (deeper) reason NOT always to generate the optimised
>> code for
>> >> >> __builtins_*?
>> >> >> Alternative question: do you also ship a library with unoptimised
>> code
>> >> >> which
>> >> >> is linked when the -O option is not given?
>> >> >>
>> >> >> Just wondering about surprising behaviour
>> >> >> Stefan
>> >> >>
>> >> >> > On Wed, Sep 9, 2020 at 4:28 AM Stefan Kanthak <
>> >> stefan.kanthak at nexgo.de>
>> >> >> > wrote:
>> >> >> >
>> >> >> >> "Craig Topper" <craig.topper at gmail.com> wrote:
>> >> >> >>
>> >> >> >> > __builtin_parity uses setnp on older x86 and popcnt with sse4.2
>> >> >> >>
>> >> >> >> Reality check, PLEASE:
>> >> >> >>
>> >> >> >> --- bug.c ---
>> >> >> >> int main(int argc, char *argv[]) {
>> >> >> >> return __builtin_parity(argc);
>> >> >> >> }
>> >> >> >> --- EOF ---
>> >> >> >>
>> >> >> >> clang -o- -target i386-pc-linux -S bug.c
>> >> >> >> clang version 10.0.0
>> >> >> >> Target: i386-pc-linux
>> >> >> >>
>> >> >> >> pushl %ebp
>> >> >> >> movl %esp, %ebp
>> >> >> >> subl $8, %esp
>> >> >> >> movl 8(%ebp), %eax
>> >> >> >> movl $0, -4(%ebp)
>> >> >> >> movl 8(%ebp), %ecx
>> >> >> >> movl %ecx, %edx
>> >> >> >> shrl %edx
>> >> >> >> andl $1431655765, %edx # imm = 0x55555555
>> >> >> >> subl %edx, %ecx
>> >> >> >> movl %ecx, %edx
>> >> >> >> andl $858993459, %edx # imm = 0x33333333
>> >> >> >> shrl $2, %ecx
>> >> >> >> andl $858993459, %ecx # imm = 0x33333333
>> >> >> >> addl %ecx, %edx
>> >> >> >> movl %edx, %ecx
>> >> >> >> shrl $4, %ecx
>> >> >> >> addl %ecx, %edx
>> >> >> >> andl $252645135, %edx # imm = 0xF0F0F0F
>> >> >> >> imull $16843009, %edx, %ecx # imm = 0x1010101
>> >> >> >> shrl $24, %ecx
>> >> >> >> andl $1, %ecx
>> >> >> >> movl %eax, -8(%ebp) # 4-byte Spill
>> >> >> >> movl %ecx, %eax
>> >> >> >> addl $8, %esp
>> >> >> >> popl %ebp
>> >> >> >> retl
>> >> >> >>
>> >> >> >>
>> >> >> >> clang -o- -target amd64-pc-linux -S bug.c
>> >> >> >>
>> >> >> >> pushq %rbp
>> >> >> >> .cfi_def_cfa_offset 16
>> >> >> >> .cfi_offset %rbp, -16
>> >> >> >> movq %rsp, %rbp
>> >> >> >> .cfi_def_cfa_register %rbp
>> >> >> >> movl $0, -4(%rbp)
>> >> >> >> movl %edi, -8(%rbp)
>> >> >> >> movl -8(%rbp), %eax
>> >> >> >> movl %eax, %ecx
>> >> >> >> shrl %ecx
>> >> >> >> andl $1431655765, %ecx # imm = 0x55555555
>> >> >> >> subl %ecx, %eax
>> >> >> >> movl %eax, %ecx
>> >> >> >> andl $858993459, %ecx # imm = 0x33333333
>> >> >> >> shrl $2, %eax
>> >> >> >> andl $858993459, %eax # imm = 0x33333333
>> >> >> >> addl %eax, %ecx
>> >> >> >> movl %ecx, %eax
>> >> >> >> shrl $4, %eax
>> >> >> >> addl %eax, %ecx
>> >> >> >> andl $252645135, %ecx # imm = 0xF0F0F0F
>> >> >> >> imull $16843009, %ecx, %eax # imm = 0x1010101
>> >> >> >> shrl $24, %eax
>> >> >> >> andl $1, %eax
>> >> >> >> popq %rbp
>> >> >> >> .cfi_def_cfa %rsp, 8
>> >> >> >> retq
>> >> >> >>
>> >> >> >> JFTR: this is the same unoptimised code as shipped in the
>> builtins
>> >> >> library!
>> >> >> >>
>> >> >> >> Stefan
>> >> >> >>
>> >> >> >> > On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <
>> >> >> stefan.kanthak at nexgo.de>
>> >> >> >> > wrote:
>> >> >> >> >
>> >> >> >> >> "Craig Topper" <craig.topper at gmail.com> wrote;
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > Clang never generates calls to ___paritysi2, ___paritydi2,
>> >> >> ___cmpdi2,
>> >> >> >> or
>> >> >> >> >>
>> >> >> >> >> > ___ucmpdi2 on X86 so its not clear the performance of this
>> >> matters
>> >> >> at
>> >> >> >> >> all.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> So you can safely remove them for X86 and all the other
>> targets
>> >> where
>> >> >> >> such
>> >> >> >> >>
>> >> >> >> >> unoptimized code is never called!
>> >> >> >> >>
>> >> >> >> >> But fix these routines for targets where they are called.
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> The statement does NOT make any exceptions, and it does not
>> say
>> >> >> >> >>
>> >> >> >> >> | ships unoptimized routines the compiler never calls
>> >> >> >> >>
>> >> >> >> >> but
>> >> >> >> >>
>> >> >> >> >> | optimized target-independent implementations
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> Stefan
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> BTW: do builtins like __builtin_*parity* exist?
>> >> >> >> >>
>> >> >> >> >> If yes: do they generate the same bad code?
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> > On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <
>> >> >> >> >>
>> >> >> >> >> > cfe-dev at lists.llvm.org> wrote:
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >>
>> >> >> >> >> >> <https://compiler-rt.llvm.org/index.html> boasts:
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> | The builtins library provides optimized implementations
>> of
>> >> this
>> >> >> >> >>
>> >> >> >> >> >> | and other low-level routines, either in
>> target-independent C
>> >> >> form,
>> >> >> >> >>
>> >> >> >> >> >> | or as a heavily-optimized assembly.
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> Really?
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> Left: inperformant code shipped in # Right: slightly
>> >> improved
>> >> >> >> code,
>> >> >> >> >>
>> >> >> >> >> >> clang_rt.builtins-* # which the
>> >> optimiser
>> >> >> >> >> REALLY
>> >> >> >> >>
>> >> >> >> >> >> # should have
>> >> >> generated
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> ___cmpdi2:
>> >> >> >> >>
>> >> >> >> >> >> mov ecx, [esp+16] # mov ecx,
>> >> >> [esp+16]
>> >> >> >> >>
>> >> >> >> >> >> xor eax, eax # xor eax,
>> eax
>> >> >> >> >>
>> >> >> >> >> >> cmp [esp+8], ecx # cmp ecx,
>> >> [esp+8]
>> >> >> >> >>
>> >> >> >> >> >> jl @f # jg @f
>> >> >> >> >>
>> >> >> >> >> >> mov eax, 2 # mov eax,
>> 2
>> >> >> >> >>
>> >> >> >> >> >> jg @f # jl @f
>> >> >> >> >>
>> >> >> >> >> >> mov ecx, [esp+4] #
>> >> >> >> >>
>> >> >> >> >> >> mov edx, [esp+12] # mov ecx,
>> >> >> [esp+12]
>> >> >> >> >>
>> >> >> >> >> >> mov eax, 0 # xor eax,
>> eax
>> >> >> >> >>
>> >> >> >> >> >> cmp ecx, edx # cmp ecx,
>> >> [esp+4]
>> >> >> >> >>
>> >> >> >> >> >> jb @f # ja @f
>> >> >> >> >>
>> >> >> >> >> >> cmp edx, ecx #
>> >> >> >> >>
>> >> >> >> >> >> mov eax, 1 #
>> >> >> >> >>
>> >> >> >> >> >> adc eax, 0 # adc eax,
>> 1
>> >> >> >> >>
>> >> >> >> >> >> @@: # @@:
>> >> >> >> >>
>> >> >> >> >> >> ret # ret
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> # 3 instructions
>> less, 10
>> >> >> bytes
>> >> >> >> >> saved
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> ___ucmpdi2:
>> >> >> >> >>
>> >> >> >> >> >> mov ecx, [esp+16] # mov ecx,
>> >> >> [esp+16]
>> >> >> >> >>
>> >> >> >> >> >> xor eax, eax # xor eax,
>> eax
>> >> >> >> >>
>> >> >> >> >> >> cmp [esp+8], ecx # cmp ecx,
>> >> [esp+8]
>> >> >> >> >>
>> >> >> >> >> >> jb @f # ja @f
>> >> >> >> >>
>> >> >> >> >> >> mov eax, 2 # mov eax,
>> 2
>> >> >> >> >>
>> >> >> >> >> >> ja @f # jb @f
>> >> >> >> >>
>> >> >> >> >> >> mov ecx, [esp+4] #
>> >> >> >> >>
>> >> >> >> >> >> mov edx, [esp+12] # mov ecx,
>> >> >> [esp+12]
>> >> >> >> >>
>> >> >> >> >> >> mov eax, 0 # xor eax,
>> eax
>> >> >> >> >>
>> >> >> >> >> >> cmp ecx, edx # cmp ecx,
>> >> [esp+4]
>> >> >> >> >>
>> >> >> >> >> >> jb @f # ja @f
>> >> >> >> >>
>> >> >> >> >> >> cmp edx, ecx #
>> >> >> >> >>
>> >> >> >> >> >> mov eax, 1 #
>> >> >> >> >>
>> >> >> >> >> >> adc eax, 0 # adc eax,
>> 1
>> >> >> >> >>
>> >> >> >> >> >> @@: # @@:
>> >> >> >> >>
>> >> >> >> >> >> ret # ret
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> # 3 instructions
>> less, 10
>> >> >> bytes
>> >> >> >> >> saved
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> Now properly written code, of course branch-free, faster
>> and
>> >> >> shorter:
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> # Copyright (C) 2004-2020, Stefan Kanthak <
>> >> >> stefan.kanthak at nexgo.de>
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> ___cmpdi2:
>> >> >> >> >>
>> >> >> >> >> >> mov ecx, [esp+4]
>> >> >> >> >>
>> >> >> >> >> >> mov edx, [esp+12]
>> >> >> >> >>
>> >> >> >> >> >> cmp ecx, edx
>> >> >> >> >>
>> >> >> >> >> >> mov eax, [esp+8]
>> >> >> >> >>
>> >> >> >> >> >> sbb eax, [esp+16]
>> >> >> >> >>
>> >> >> >> >> >> setl ah
>> >> >> >> >>
>> >> >> >> >> >> cmp edx, ecx
>> >> >> >> >>
>> >> >> >> >> >> mov edx, [esp+16]
>> >> >> >> >>
>> >> >> >> >> >> sbb edx, [esp+8]
>> >> >> >> >>
>> >> >> >> >> >> setl al
>> >> >> >> >>
>> >> >> >> >> >> sub al, ah
>> >> >> >> >>
>> >> >> >> >> >> movsx eax, al
>> >> >> >> >>
>> >> >> >> >> >> inc eax
>> >> >> >> >>
>> >> >> >> >> >> ret
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> ___ucmpdi2:
>> >> >> >> >>
>> >> >> >> >> >> mov ecx, [esp+4]
>> >> >> >> >>
>> >> >> >> >> >> mov edx, [esp+12]
>> >> >> >> >>
>> >> >> >> >> >> cmp ecx, edx
>> >> >> >> >>
>> >> >> >> >> >> mov eax, [esp+8]
>> >> >> >> >>
>> >> >> >> >> >> sbb eax, [esp+16]
>> >> >> >> >>
>> >> >> >> >> >> sbb eax, eax
>> >> >> >> >>
>> >> >> >> >> >> cmp edx, ecx
>> >> >> >> >>
>> >> >> >> >> >> mov edx, [esp+16]
>> >> >> >> >>
>> >> >> >> >> >> sbb edx, [esp+8]
>> >> >> >> >>
>> >> >> >> >> >> adc eax, 1
>> >> >> >> >>
>> >> >> >> >> >> ret
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> AGAIN:
>> >> >> >> >>
>> >> >> >> >> >> Remove every occurance of the word "optimized" on the
>> above web
>> >> >> page.
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >> 'nuff said
>> >> >> >> >>
>> >> >> >> >> >> Stefan
>> >> >> >> >>
>> >> >> >> >> >> _______________________________________________
>> >> >> >> >>
>> >> >> >> >> >> cfe-dev mailing list
>> >> >> >> >>
>> >> >> >> >> >> cfe-dev at lists.llvm.org
>> >> >> >> >>
>> >> >> >> >> >> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> > --
>> >> >> >> > ~Craig
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200909/3d24cd5b/attachment-0001.html>
More information about the cfe-dev
mailing list