[llvm-dev] RFC [ThinLTO]: Promoting more aggressively in order to reduce incremental link time and allow sharing between linkage units

Thu Apr 7 12:32:58 PDT 2016

On Wed, Apr 6, 2016 at 9:53 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:

>
> On Apr 6, 2016, at 9:40 PM, Teresa Johnson <tejohnson at google.com> wrote:
>
>
>
> On Wed, Apr 6, 2016 at 5:13 PM, Peter Collingbourne <peter at pcc.me.uk>
> wrote:
>
>>
>>
>> On Wed, Apr 6, 2016 at 4:53 PM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>>
>>>
>>> On Apr 6, 2016, at 4:41 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
>>>
>>> Hi all,
>>>
>>> I'd like to propose changes to how we do promotion of global values in
>>> ThinLTO. The goal here is to make it possible to pre-compile parts of the
>>> translation unit to native code at compile time. For example, if we know
>>> that:
>>>
>>> 1) A function is a leaf function, so it will never import any other
>>> functions, and
>>>
>>>
>>> It still may be imported somewhere else right?
>>>
>>> 2) The function's instruction count falls above a threshold specified at
>>> compile time, so it will never be imported.
>>>
>>>
>>> It won’t be imported, but unless it is a “leaf” it may import and inline
>>> itself.
>>>
>>
>>> or
>>> 3) The compile-time threshold is zero, so there is no possibility of
>>> functions being imported (What's the utility of this? Consider a program
>>> transformation that requires whole-program information, such as CFI. During
>>> development, the import threshold may be set to zero in order to minimize
>>> the incremental link time while still providing the same CFI enforcement
>>> that would be used in production builds of the application.)
>>>
>>> then the function's body will not be affected by link-time decisions,
>>> and we might as well produce its object code at compile time.
>>>
>>>
>>> Reading this last sentence, it seems exactly the “non-LTO” case?
>>>
>>
>> Yes, basically the point of this proposal is to be able to split the
>> linkage unit into LTO and non-LTO parts.
>>
>>
>>> This will also allow the object code to be shared between linkage units
>>> (this should hopefully help solve a major scalability problem for Chromium,
>>> as that project contains a large number of test binaries based on common
>>> libraries).
>>>
>>> This can be done with a change to the intermediate object file format.
>>> We can represent object files as native code containing statically compiled
>>> functions and global data in the .text,. data, .rodata (etc.) sections,
>>> with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when targeting
>>> Mach-O) containing bitcode for functions to be compiled at link time.
>>>
>>> In order to make this work, we need to make sure that references from
>>> link-time compiled functions to statically compiled functions work
>>> correctly in the case where the statically compiled function has internal
>>> linkage. We can do this by promoting every global value with internal
>>> linkage, using a hash of the external names (as I mentioned in [1]).
>>>
>>>
> Mehdi - I know you were keen to reduce the amount of promotion. Is that
> still an issue for you assuming linker GC (dead stripping)?
>
>
> Yes: we do better optimization on internal function in general. Our
> benchmarks showed that it can really make some difference, and many cases
> were ThinLTO didn’t perform as well as FullLTO were because of this
> promotion.
> (binary size has never been my concern here)
>

> With this proposal we will need to stick with the current promote
> everything scheme.
>
>
> I don’t think so: you would need do it only for “internal functions that a
> leaf and aren’t likely to be imported/inlined”.
> That said any function that we emit the binary at compile time instead of
> link time will contribute to inhibit optimizations for LTO/ThinLTO. The
> gain in compile time has to be really important to make it worth it.
> (Of course the CFI use-case is a totally different tradeoff).
>
> Peter: have you thought about debug info by the way?
>

Yes, I suspect we'll have to duplicate the debug info between the non-LTO
and the LTO part like I was doing with parallel LTO codegen. In practice,
that means we'll end up with two DWARF compile units per TU. That's
probably better than it being a factor of the number of threads, and since
only one of the two compile units will be codegen'd at any one time, we
hopefully shouldn't see the sort of memory consumption we were seeing with
parallel LTO codegen.

Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160407/90800b43/attachment.html>