[llvm-dev] The builtins library of compiler-rt is a performance HOG^WKILLER

Wed Dec 5 03:17:54 PST 2018

"Chris Bieneman" <chris.bieneman at me.com> wrote:

>> On Dec 3, 2018, at 10:50 AM, Stefan Kanthak via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>> 
>> "Craig Topper" <craig.topper at gmail.com> wrote:
>> 
>>> None of the "si" division routines will be used by x86.
>> 
>> That was my expectation too.
>> 
>>> They exist for targets that don't support the operations natively.
>>> X86 supports them natively so will never use the library functions.
>> 
>> So they SHOULD not be built (or at least not shipped) with the
>> builtins library for x86.
>
> I think you will find that down this path lies madness. Apple has
> tried for many years to limit which builtins get shipped in compiler-rt
> to just the smallest correct set to reduce the distribution size of
> clang. Over the years we've taken several different approaches, and
> they are all error prone and result in bugs. This problem stems from
> the fact that generation of builtin calls can be triggered by
> optimization settings, architecture, ABI, or the lunar cycle.
>
> Initially we (Apple) maintained per-architecture lists of builtins,
> but those lists wouldn't get updated when new builtins got added,
> and we'd get bugs often after we shipped. Then I moved to an inverted
> system where we maintained lists to exclude, allowing that all new
> builtins always got added, but that has turned out to be a mess
> because it is really hard to know if it is safe to exclude something,
> and *oh wait the compiler changed and now it isn't safe anymore*.
>
> IMO, and coming from some painful experience, I think including all
> builtin functions is the easiest way to make less buggy release,
> until someone comes along and comes up with a definitive way for us
> to always know if a given builtin is possible to generate with a given
> compiler.

Thanks for this background information.
Now if such design decisions, and/or their outcome, like (for example)

| compiler-rt ships with all routines available for the supported
| architectures

were published on the compiler-rt web pages, I won't need to ask.

regards
Stefan

>> X86 has its own assembly implementation of __muldi3 that uses 32-bit
>> pieces.
> 
> I know; that's why I placed this ABOVE my "JFTR:"
> 
>> We should be using the assembly versions of the "di" division routines on
>> i386. Except when compiler-rt is built with MSVC because MSVC can't parse
>> the at&t assembly syntax.
> 
> Again: my offer to provide these routines still stands!
> 
> I have OPTIMISED __divdi3, __moddi3, __udivdi3 and __umoddi3 in
> Intel syntax, wrapped as inline files into an NMakefile, for use
> with ML.EXE.
> For the optimisations see the patch I sent last week.
> 
> Since Howard Hinnant is NO MORE with LLVM: who is the CURRENT
> code owner and reviewer for the builtins library, especially for
> x86?
> 
> I'm asking this SIMPLE question now for the 3rd time!
> 
> I also have __udivmoddi3: adding the pointer to the remainder as
> argument and 4 more instructions will turn it into __udivmoddi4.
> 
> Compiling them with MSVC is of course easy to achieve: remove the
> MASM/ML statements, put the assembler source inside an __asm block,
> and add a function definition with __declspec(naked)
> 
> But then someone will have to find new filenames; I'd prefer to
> leave them as *.ASM, so they can be added to YOUR source tree
> without clobbering existing files.
> 
> The same holds for __alldiv, __alldvrm, __allrem, __aulldiv,
> __aulldvrm and __aullrem, plus __allmul, __allshl, _allshr and
> __aullshr.
> 
> If you name a reviewer I'll send them to llvm-commits!
> 
> regards
> Stefan
> 
>> On Mon, Dec 3, 2018 at 5:51 AM Stefan Kanthak via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>> 
>>> Hi @ll,
>>> 
>>> LLVM-7.0.0-win32.exe contains and installs
>>> lib\clang\7.0.0\lib\windows\clang_rt.builtins-i386.lib
>>> 
>>> The implementation of (at least) the multiplication and division
>>> routines __[u]{div,mod,divmod,mul}[sdt]i[34] shipped with this
>>> libraries SUCKS: they are factors SLOWER than even Microsoft's
>>> NOTORIOUS POOR implementation of 64-bit division shipped with
>>> MSVC and Windows!
>>> 
>>> The reasons: 1. subroutine matroschka, 2. "C" implementation!
>>> 
>>> JFTR: the target processor "i386" (introduced October 1985) is
>>>      a 32-bit processor, it has instructions to divide 64-bit
>>>      integers by 32-bit integers, and to multiply two 32-bit
>>>      integers giving a 64-bit product!
>>>      I expect that a library written 20+ years later takes
>>>      advantage of these instructions!
>>> 
>>> __divsi3 (18 instructions) perform a DIV after 2 calls of abs(),
>>>                           plus a final negation, instead of just
>>>                           a single IDIV
>>> __modsi3 (14 instructions) calls __divsi3 (18 instructions)
>>> __divmodsi4 (17 instructions) calls __divsi3 (18 instructions)
>>> 
>>> __udivsi3 (52 instructions) does NOT use DIV, but performs BITWISE
>>>                            division using shifts and additions!
>>> __umodsi3 (14 instructions) calls __udivsi3 (52 instructions)
>>> __udivmodsi4 (17 instructions) calls __udivsi3 (52 instructions)
>>> 
>>> __muldi3 (41 instructions) performs a "long" multiplication on
>>>                           16-bit "digits"
>>> 
>>> JFTR: I haven't checked whether clang actually calls these
>>>      SUPERFLUOUS routines listed above.
>>>      IT BETTER SHOULD NOT, NEVER!
>>> 
>>> __divdi3 (37 instructions) calls __udivmoddi4 (254 instructions)
>>> __moddi3 (51 instructions) calls __udivmoddi4 (254 instructions)
>>> __divmoddi4 (36 instructions) calls __divdi3 (37 instructions) which
>>>                              calls __udivmoddi4 (254 instructions)
>>> __udivdi3 (8 instructions) calls __udivmoddi4 (254 instructions)
>>> __umoddi3 (33 instructions) calls __udivmoddi4 (254 instructions)
>>> 
>>> JFTR: the subdirectory compiler-rt/lib/builtins/i386/ contains FAR
>>>      better (although suboptimal) __divdi3, __moddi3, __udivdi3 and
>>>      __umoddi3 routines written in assembler, which SHOULD be
>>>      shipped with clang_rt.builtins-i386.lib instead of the above
>>>      listed POOR and NOT optimised implementations!
>>> 
>>> NOT AMUSED
>>> Stefan Kanthak
>>> 
>>> PS: <https://lists.llvm.org/pipermail/llvm-dev/2018-November/128094.html>
>>>    has patches for the assembler routines!
>>> 
>>> PPS: please remove the blatant lie
>>>     | The builtins library provides optimized implementations of
>>>     | this and other low-level routines, either in target-independent
>>>     | C form, or as a heavily-optimized assembly.
>>>     seen on <https://compiler-rt.llvm.org/>
>>>     These routines are NOT optimized, and for sure NOT heavily-
>>>     optimized!
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> 
>> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev