<div dir="ltr">Those two lines of code are slightly different. The first asumes sr to be 0-63. The second allows sr to be 0-127. So we might end up with a runtime check on bit 6 and a cmove to handle 127-64 and 63-0 differently. Maybe the compiler figures out from the surrounding code that bit 6 is 0, but I'd have to check.<div><br></div><div> foo.high = (foo.high << sr) | (foo.low >> (64 - sr));<br>instead of just<br> foo.all <<= sr, </div><div><br></div><div><br clear="all"><div><div dir="ltr" class="gmail_signature" data-smartmail="gmail_signature">~Craig</div></div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Sep 9, 2020 at 10:53 AM Stefan Kanthak <<a href="mailto:stefan.kanthak@nexgo.de">stefan.kanthak@nexgo.de</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">"Craig Topper" <<a href="mailto:craig.topper@gmail.com" target="_blank">craig.topper@gmail.com</a>> wrote:<br>
<br>
> Currently we skip the build of __udivmodti4(), __multi3(), __modti3() etc.<br>
> when compiling compiler-rt with MSVC. The problem is more than just the<br>
> ABI. A bunch of the C code also assumes for example that you can do >> on<br>
> __uint128_t.<br>
<br>
Correct.<br>
OTOH, at least udivmodti4.c contains a whole lot of rather ugly "compound"<br>
shifts<br>
foo.high = (foo.high << sr) | (foo.low >> (64 - sr));<br>
instead of just<br>
foo.all <<= sr,<br>
<br>
The same holds of course for udivmoddi4.c too!<br>
JFTR: ANSI/ISO C introduced 64-bit data types in the last millennium.<br>
<br>
Again my question: who reviews such code?<br>
Let alone its poor^Wdisastrous performance...<br>
<br>
> But if the __uint128_t is a struct as would be needed for MSVC<br>
> that code won't compile. Similar for addition, subtraction, etc.<br>
<br>
I know (not that I tried, reading alone is sufficient to notice that).<br>
<br>
> Relevant code from the compiler-rt where we disable 128-bit.<br>
> <br>
> // MSVC doesn't have a working 128bit integer type. Users should really<br>
> compile<br>
> // compiler-rt with clang, but if they happen to be doing a standalone<br>
> build for<br>
> // asan or something else, disable the 128 bit parts so things sort of work.<br>
> #if defined(_MSC_VER) && !defined(__clang__)<br>
> #undef CRT_HAS_128BIT<br>
> #endif<br>
<br>
I looked only at COMPILER_RT_ABI and didn't notice that #undef<br>
<br>
Stefan<br>
<br>
> On Wed, Sep 9, 2020 at 10:27 AM Stefan Kanthak <<a href="mailto:stefan.kanthak@nexgo.de" target="_blank">stefan.kanthak@nexgo.de</a>><br>
> wrote:<br>
> <br>
>> "Craig Topper" <<a href="mailto:craig.topper@gmail.com" target="_blank">craig.topper@gmail.com</a>> wrote:<br>
>><br>
>> I should have added "rhetorical" to my alternative question. Sorry.<br>
>> The main question is but still open: is there a (deeper) reason to not<br>
>> always generate the optimised code for __builtins_*?<br>
>><br>
>> > The code in the builtins library is written in C and can't use<br>
>> > __builtin_parity since it needs to compile with compilers that don't have<br>
>> > __builtin_parity like MSVC.<br>
>><br>
>> Aahhhhh: how do you build the __udivmodti4(), __multi3(), __modti3() etc.<br>
>> routines for clang_rt.builtins-x86_64.lib with MSVC?<br>
>> As you know by now, there are some fidgety but strictly necessary details<br>
>> missing in the declaration of functions returning __uint128_t or<br>
>> __int128_t!<br>
>><br>
>> regards<br>
>> Stefan<br>
>><br>
>> > So to get the optimized code, we would have to<br>
>> > have a version written in assembly for X86. Probably two assembly<br>
>> versions<br>
>> > since MASM and gcc use different assembly syntax. So we'd have 3<br>
>> different<br>
>> > versions of parity for 3 different bit widths. Maybe some macros could<br>
>> > allow us to share some bit widths or something. But ultimately it was a<br>
>> > question of where to spend effort.<br>
>> ><br>
>> > ~Craig<br>
>> ><br>
>> ><br>
>> > On Wed, Sep 9, 2020 at 9:08 AM Stefan Kanthak <<a href="mailto:stefan.kanthak@nexgo.de" target="_blank">stefan.kanthak@nexgo.de</a>><br>
>> > wrote:<br>
>> ><br>
>> >> "Craig Topper" <<a href="mailto:craig.topper@gmail.com" target="_blank">craig.topper@gmail.com</a>> wrote:<br>
>> >><br>
>> >> > Turn on optimizations.<br>
>> >><br>
>> >> What's the (deeper) reason NOT always to generate the optimised code for<br>
>> >> __builtins_*?<br>
>> >> Alternative question: do you also ship a library with unoptimised code<br>
>> >> which<br>
>> >> is linked when the -O option is not given?<br>
>> >><br>
>> >> Just wondering about surprising behaviour<br>
>> >> Stefan<br>
>> >><br>
>> >> > On Wed, Sep 9, 2020 at 4:28 AM Stefan Kanthak <<br>
>> <a href="mailto:stefan.kanthak@nexgo.de" target="_blank">stefan.kanthak@nexgo.de</a>><br>
>> >> > wrote:<br>
>> >> ><br>
>> >> >> "Craig Topper" <<a href="mailto:craig.topper@gmail.com" target="_blank">craig.topper@gmail.com</a>> wrote:<br>
>> >> >><br>
>> >> >> > __builtin_parity uses setnp on older x86 and popcnt with sse4.2<br>
>> >> >><br>
>> >> >> Reality check, PLEASE:<br>
>> >> >><br>
>> >> >> --- bug.c ---<br>
>> >> >> int main(int argc, char *argv[]) {<br>
>> >> >> return __builtin_parity(argc);<br>
>> >> >> }<br>
>> >> >> --- EOF ---<br>
>> >> >><br>
>> >> >> clang -o- -target i386-pc-linux -S bug.c<br>
>> >> >> clang version 10.0.0<br>
>> >> >> Target: i386-pc-linux<br>
>> >> >><br>
>> >> >> pushl %ebp<br>
>> >> >> movl %esp, %ebp<br>
>> >> >> subl $8, %esp<br>
>> >> >> movl 8(%ebp), %eax<br>
>> >> >> movl $0, -4(%ebp)<br>
>> >> >> movl 8(%ebp), %ecx<br>
>> >> >> movl %ecx, %edx<br>
>> >> >> shrl %edx<br>
>> >> >> andl $1431655765, %edx # imm = 0x55555555<br>
>> >> >> subl %edx, %ecx<br>
>> >> >> movl %ecx, %edx<br>
>> >> >> andl $858993459, %edx # imm = 0x33333333<br>
>> >> >> shrl $2, %ecx<br>
>> >> >> andl $858993459, %ecx # imm = 0x33333333<br>
>> >> >> addl %ecx, %edx<br>
>> >> >> movl %edx, %ecx<br>
>> >> >> shrl $4, %ecx<br>
>> >> >> addl %ecx, %edx<br>
>> >> >> andl $252645135, %edx # imm = 0xF0F0F0F<br>
>> >> >> imull $16843009, %edx, %ecx # imm = 0x1010101<br>
>> >> >> shrl $24, %ecx<br>
>> >> >> andl $1, %ecx<br>
>> >> >> movl %eax, -8(%ebp) # 4-byte Spill<br>
>> >> >> movl %ecx, %eax<br>
>> >> >> addl $8, %esp<br>
>> >> >> popl %ebp<br>
>> >> >> retl<br>
>> >> >><br>
>> >> >><br>
>> >> >> clang -o- -target amd64-pc-linux -S bug.c<br>
>> >> >><br>
>> >> >> pushq %rbp<br>
>> >> >> .cfi_def_cfa_offset 16<br>
>> >> >> .cfi_offset %rbp, -16<br>
>> >> >> movq %rsp, %rbp<br>
>> >> >> .cfi_def_cfa_register %rbp<br>
>> >> >> movl $0, -4(%rbp)<br>
>> >> >> movl %edi, -8(%rbp)<br>
>> >> >> movl -8(%rbp), %eax<br>
>> >> >> movl %eax, %ecx<br>
>> >> >> shrl %ecx<br>
>> >> >> andl $1431655765, %ecx # imm = 0x55555555<br>
>> >> >> subl %ecx, %eax<br>
>> >> >> movl %eax, %ecx<br>
>> >> >> andl $858993459, %ecx # imm = 0x33333333<br>
>> >> >> shrl $2, %eax<br>
>> >> >> andl $858993459, %eax # imm = 0x33333333<br>
>> >> >> addl %eax, %ecx<br>
>> >> >> movl %ecx, %eax<br>
>> >> >> shrl $4, %eax<br>
>> >> >> addl %eax, %ecx<br>
>> >> >> andl $252645135, %ecx # imm = 0xF0F0F0F<br>
>> >> >> imull $16843009, %ecx, %eax # imm = 0x1010101<br>
>> >> >> shrl $24, %eax<br>
>> >> >> andl $1, %eax<br>
>> >> >> popq %rbp<br>
>> >> >> .cfi_def_cfa %rsp, 8<br>
>> >> >> retq<br>
>> >> >><br>
>> >> >> JFTR: this is the same unoptimised code as shipped in the builtins<br>
>> >> library!<br>
>> >> >><br>
>> >> >> Stefan<br>
>> >> >><br>
>> >> >> > On Sun, Sep 6, 2020 at 1:32 PM Stefan Kanthak <<br>
>> >> <a href="mailto:stefan.kanthak@nexgo.de" target="_blank">stefan.kanthak@nexgo.de</a>><br>
>> >> >> > wrote:<br>
>> >> >> ><br>
>> >> >> >> "Craig Topper" <<a href="mailto:craig.topper@gmail.com" target="_blank">craig.topper@gmail.com</a>> wrote;<br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >> > Clang never generates calls to ___paritysi2, ___paritydi2,<br>
>> >> ___cmpdi2,<br>
>> >> >> or<br>
>> >> >> >><br>
>> >> >> >> > ___ucmpdi2 on X86 so its not clear the performance of this<br>
>> matters<br>
>> >> at<br>
>> >> >> >> all.<br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >> So you can safely remove them for X86 and all the other targets<br>
>> where<br>
>> >> >> such<br>
>> >> >> >><br>
>> >> >> >> unoptimized code is never called!<br>
>> >> >> >><br>
>> >> >> >> But fix these routines for targets where they are called.<br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >> The statement does NOT make any exceptions, and it does not say<br>
>> >> >> >><br>
>> >> >> >> | ships unoptimized routines the compiler never calls<br>
>> >> >> >><br>
>> >> >> >> but<br>
>> >> >> >><br>
>> >> >> >> | optimized target-independent implementations<br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >> Stefan<br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >> BTW: do builtins like __builtin_*parity* exist?<br>
>> >> >> >><br>
>> >> >> >> If yes: do they generate the same bad code?<br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> >> > On Sun, Sep 6, 2020 at 12:31 PM Stefan Kanthak via cfe-dev <<br>
>> >> >> >><br>
>> >> >> >> > <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<br>
>> >> >> >><br>
>> >> >> >> ><br>
>> >> >> >><br>
>> >> >> >> >> <<a href="https://compiler-rt.llvm.org/index.html" rel="noreferrer" target="_blank">https://compiler-rt.llvm.org/index.html</a>> boasts:<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> | The builtins library provides optimized implementations of<br>
>> this<br>
>> >> >> >><br>
>> >> >> >> >> | and other low-level routines, either in target-independent C<br>
>> >> form,<br>
>> >> >> >><br>
>> >> >> >> >> | or as a heavily-optimized assembly.<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> Really?<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> Left: inperformant code shipped in # Right: slightly<br>
>> improved<br>
>> >> >> code,<br>
>> >> >> >><br>
>> >> >> >> >> clang_rt.builtins-* # which the<br>
>> optimiser<br>
>> >> >> >> REALLY<br>
>> >> >> >><br>
>> >> >> >> >> # should have<br>
>> >> generated<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> ___cmpdi2:<br>
>> >> >> >><br>
>> >> >> >> >> mov ecx, [esp+16] # mov ecx,<br>
>> >> [esp+16]<br>
>> >> >> >><br>
>> >> >> >> >> xor eax, eax # xor eax, eax<br>
>> >> >> >><br>
>> >> >> >> >> cmp [esp+8], ecx # cmp ecx,<br>
>> [esp+8]<br>
>> >> >> >><br>
>> >> >> >> >> jl @f # jg @f<br>
>> >> >> >><br>
>> >> >> >> >> mov eax, 2 # mov eax, 2<br>
>> >> >> >><br>
>> >> >> >> >> jg @f # jl @f<br>
>> >> >> >><br>
>> >> >> >> >> mov ecx, [esp+4] #<br>
>> >> >> >><br>
>> >> >> >> >> mov edx, [esp+12] # mov ecx,<br>
>> >> [esp+12]<br>
>> >> >> >><br>
>> >> >> >> >> mov eax, 0 # xor eax, eax<br>
>> >> >> >><br>
>> >> >> >> >> cmp ecx, edx # cmp ecx,<br>
>> [esp+4]<br>
>> >> >> >><br>
>> >> >> >> >> jb @f # ja @f<br>
>> >> >> >><br>
>> >> >> >> >> cmp edx, ecx #<br>
>> >> >> >><br>
>> >> >> >> >> mov eax, 1 #<br>
>> >> >> >><br>
>> >> >> >> >> adc eax, 0 # adc eax, 1<br>
>> >> >> >><br>
>> >> >> >> >> @@: # @@:<br>
>> >> >> >><br>
>> >> >> >> >> ret # ret<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> # 3 instructions less, 10<br>
>> >> bytes<br>
>> >> >> >> saved<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> ___ucmpdi2:<br>
>> >> >> >><br>
>> >> >> >> >> mov ecx, [esp+16] # mov ecx,<br>
>> >> [esp+16]<br>
>> >> >> >><br>
>> >> >> >> >> xor eax, eax # xor eax, eax<br>
>> >> >> >><br>
>> >> >> >> >> cmp [esp+8], ecx # cmp ecx,<br>
>> [esp+8]<br>
>> >> >> >><br>
>> >> >> >> >> jb @f # ja @f<br>
>> >> >> >><br>
>> >> >> >> >> mov eax, 2 # mov eax, 2<br>
>> >> >> >><br>
>> >> >> >> >> ja @f # jb @f<br>
>> >> >> >><br>
>> >> >> >> >> mov ecx, [esp+4] #<br>
>> >> >> >><br>
>> >> >> >> >> mov edx, [esp+12] # mov ecx,<br>
>> >> [esp+12]<br>
>> >> >> >><br>
>> >> >> >> >> mov eax, 0 # xor eax, eax<br>
>> >> >> >><br>
>> >> >> >> >> cmp ecx, edx # cmp ecx,<br>
>> [esp+4]<br>
>> >> >> >><br>
>> >> >> >> >> jb @f # ja @f<br>
>> >> >> >><br>
>> >> >> >> >> cmp edx, ecx #<br>
>> >> >> >><br>
>> >> >> >> >> mov eax, 1 #<br>
>> >> >> >><br>
>> >> >> >> >> adc eax, 0 # adc eax, 1<br>
>> >> >> >><br>
>> >> >> >> >> @@: # @@:<br>
>> >> >> >><br>
>> >> >> >> >> ret # ret<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> # 3 instructions less, 10<br>
>> >> bytes<br>
>> >> >> >> saved<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> Now properly written code, of course branch-free, faster and<br>
>> >> shorter:<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> # Copyright (C) 2004-2020, Stefan Kanthak <<br>
>> >> <a href="mailto:stefan.kanthak@nexgo.de" target="_blank">stefan.kanthak@nexgo.de</a>><br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> ___cmpdi2:<br>
>> >> >> >><br>
>> >> >> >> >> mov ecx, [esp+4]<br>
>> >> >> >><br>
>> >> >> >> >> mov edx, [esp+12]<br>
>> >> >> >><br>
>> >> >> >> >> cmp ecx, edx<br>
>> >> >> >><br>
>> >> >> >> >> mov eax, [esp+8]<br>
>> >> >> >><br>
>> >> >> >> >> sbb eax, [esp+16]<br>
>> >> >> >><br>
>> >> >> >> >> setl ah<br>
>> >> >> >><br>
>> >> >> >> >> cmp edx, ecx<br>
>> >> >> >><br>
>> >> >> >> >> mov edx, [esp+16]<br>
>> >> >> >><br>
>> >> >> >> >> sbb edx, [esp+8]<br>
>> >> >> >><br>
>> >> >> >> >> setl al<br>
>> >> >> >><br>
>> >> >> >> >> sub al, ah<br>
>> >> >> >><br>
>> >> >> >> >> movsx eax, al<br>
>> >> >> >><br>
>> >> >> >> >> inc eax<br>
>> >> >> >><br>
>> >> >> >> >> ret<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> ___ucmpdi2:<br>
>> >> >> >><br>
>> >> >> >> >> mov ecx, [esp+4]<br>
>> >> >> >><br>
>> >> >> >> >> mov edx, [esp+12]<br>
>> >> >> >><br>
>> >> >> >> >> cmp ecx, edx<br>
>> >> >> >><br>
>> >> >> >> >> mov eax, [esp+8]<br>
>> >> >> >><br>
>> >> >> >> >> sbb eax, [esp+16]<br>
>> >> >> >><br>
>> >> >> >> >> sbb eax, eax<br>
>> >> >> >><br>
>> >> >> >> >> cmp edx, ecx<br>
>> >> >> >><br>
>> >> >> >> >> mov edx, [esp+16]<br>
>> >> >> >><br>
>> >> >> >> >> sbb edx, [esp+8]<br>
>> >> >> >><br>
>> >> >> >> >> adc eax, 1<br>
>> >> >> >><br>
>> >> >> >> >> ret<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> AGAIN:<br>
>> >> >> >><br>
>> >> >> >> >> Remove every occurance of the word "optimized" on the above web<br>
>> >> page.<br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >> >> 'nuff said<br>
>> >> >> >><br>
>> >> >> >> >> Stefan<br>
>> >> >> >><br>
>> >> >> >> >> _______________________________________________<br>
>> >> >> >><br>
>> >> >> >> >> cfe-dev mailing list<br>
>> >> >> >><br>
>> >> >> >> >> <a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
>> >> >> >><br>
>> >> >> >> >> <a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
>> >> >> >><br>
>> >> >> >> >><br>
>> >> >> >><br>
>> >> >> >><br>
>> >> >> ><br>
>> >> >> > --<br>
>> >> >> > ~Craig<br>
>> >> >> ><br>
>> >> >><br>
>> >><br>
>><br>
</blockquote></div>