[llvm-dev] RFC [ThinLTO]: Promoting more aggressively in order to reduce incremental link time and allow sharing between linkage units

Thu Apr 7 21:59:57 PDT 2016

On Thu, Apr 7, 2016 at 3:25 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:

>
> On Apr 7, 2016, at 1:21 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
>
>
>
> On Thu, Apr 7, 2016 at 12:52 PM, Mehdi Amini <mehdi.amini at apple.com>
> wrote:
>
>>
>> On Apr 7, 2016, at 12:39 PM, Peter Collingbourne <peter at pcc.me.uk> wrote:
>>
>>
>>
>> On Thu, Apr 7, 2016 at 12:29 PM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>>
>>>
>>> On Apr 7, 2016, at 11:59 AM, Xinliang David Li <davidxl at google.com>
>>> wrote:
>>>
>>>
>>>
>>> On Thu, Apr 7, 2016 at 11:26 AM, Mehdi Amini <mehdi.amini at apple.com>
>>> wrote:
>>>
>>>>
>>>> On Apr 7, 2016, at 10:58 AM, Xinliang David Li <davidxl at google.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> On Wed, Apr 6, 2016 at 9:53 PM, Mehdi Amini <mehdi.amini at apple.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On Apr 6, 2016, at 9:40 PM, Teresa Johnson <tejohnson at google.com>
>>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Apr 6, 2016 at 5:13 PM, Peter Collingbourne <peter at pcc.me.uk>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 6, 2016 at 4:53 PM, Mehdi Amini <mehdi.amini at apple.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Apr 6, 2016, at 4:41 PM, Peter Collingbourne <peter at pcc.me.uk>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'd like to propose changes to how we do promotion of global values
>>>>>>> in ThinLTO. The goal here is to make it possible to pre-compile parts of
>>>>>>> the translation unit to native code at compile time. For example, if we
>>>>>>> know that:
>>>>>>>
>>>>>>> 1) A function is a leaf function, so it will never import any other
>>>>>>> functions, and
>>>>>>>
>>>>>>>
>>>>>>> It still may be imported somewhere else right?
>>>>>>>
>>>>>>> 2) The function's instruction count falls above a threshold
>>>>>>> specified at compile time, so it will never be imported.
>>>>>>>
>>>>>>>
>>>>>>> It won’t be imported, but unless it is a “leaf” it may import and
>>>>>>> inline itself.
>>>>>>>
>>>>>>
>>>>>>> or
>>>>>>> 3) The compile-time threshold is zero, so there is no possibility of
>>>>>>> functions being imported (What's the utility of this? Consider a program
>>>>>>> transformation that requires whole-program information, such as CFI. During
>>>>>>> development, the import threshold may be set to zero in order to minimize
>>>>>>> the incremental link time while still providing the same CFI enforcement
>>>>>>> that would be used in production builds of the application.)
>>>>>>>
>>>>>>> then the function's body will not be affected by link-time
>>>>>>> decisions, and we might as well produce its object code at compile time.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Reading this last sentence, it seems exactly the “non-LTO” case?
>>>>>>>
>>>>>>
>>>>>> Yes, basically the point of this proposal is to be able to split the
>>>>>> linkage unit into LTO and non-LTO parts.
>>>>>>
>>>>>>
>>>>>>> This will also allow the object code to be shared between linkage
>>>>>>> units (this should hopefully help solve a major scalability problem for
>>>>>>> Chromium, as that project contains a large number of test binaries based on
>>>>>>> common libraries).
>>>>>>>
>>>>>>> This can be done with a change to the intermediate object file
>>>>>>> format. We can represent object files as native code containing statically
>>>>>>> compiled functions and global data in the .text,. data, .rodata (etc.)
>>>>>>> sections, with an .llvmbc section (or, I suppose, "__LLVM, __bitcode" when
>>>>>>> targeting Mach-O) containing bitcode for functions to be compiled at link
>>>>>>> time.
>>>>>>>
>>>>>>> In order to make this work, we need to make sure that references
>>>>>>> from link-time compiled functions to statically compiled functions work
>>>>>>> correctly in the case where the statically compiled function has internal
>>>>>>> linkage. We can do this by promoting every global value with internal
>>>>>>> linkage, using a hash of the external names (as I mentioned in [1]).
>>>>>>>
>>>>>>>
>>>>> Mehdi - I know you were keen to reduce the amount of promotion. Is
>>>>> that still an issue for you assuming linker GC (dead stripping)?
>>>>>
>>>>>
>>>>> Yes: we do better optimization on internal function in general.
>>>>>
>>>>
>>>> Inliner is one of the affected optimization -- however this sounds like
>>>> a matter of tuning to teach inliner about promoted static functions.
>>>>
>>>>
>>>> The inliner compute a tradeoff between pseudo runtime cost and binary
>>>> size, the existing bonus for static functions is when there is a single
>>>> call site because it makes the binary increase inexistant (dropping the
>>>> static after inline). We promote function because we think we are likely to
>>>> introduce a reference to it somewhere else, so “lying” to the inliner is
>>>> not necessarily a good idea.
>>>>
>>>
>>> It is not lying to the inliner. If a static (before promotion) function
>>> is a candidate to be inlined in the original defining module, it is
>>> probably more likely to inlined in other importing modules where more
>>> context is available. In other words, the inliner can apply the same bonus
>>> to 'promoted' static functions as if references in other modules will also
>>> disappear.  Of course, we can not assume it has single callsite.
>>>
>>> Comdat functions can be handled similarly.
>>>
>>>
>>>
>>>> That said we (actually Bruno did) prototyped it already with somehow
>>>> good results :)
>>>> I’m not convinced yet that it should be independent of promoted or not
>>>> promoted though.
>>>>
>>>
>>> Generally true (see the comdat case).
>>>
>>>
>>>>
>>>> Assuming we solve the inliner issue, then remain the “optimizations
>>>> other than inliner”. We can probably solve most but I suspect it won’t be
>>>> “trivial” either.
>>>>
>>>
>>>
>>> Any such optimizations in mind?
>>>
>>>
>>> I don’t have the details, but in short:
>>>
>>> For promoted functions: IPSCCP, dead arg elimination
>>> For promoted global variables: anything that is impacted somehow by
>>> aliasing
>>>
>>
>> When are you imagining that promotion would happen? If it happens just
>> before codegen (or bitcode emission), it wouldn't inhibit these
>> optimizations, right?
>>
>>
>> For ThinLTO it has to happen before the link-time optimizations, because
>> of cross-module importing.
>>
>
> Are you referring to the fact that these optimizations would be inhibited
> versus regular LTO, since we cannot internalize? Yes, that does seem like
> an issue.
>
>
> Yes, this is an issue I'm fighting with currently with ThinLTO.
> (And I haven't reach the tuning stage yet because I can't nail the
> infrastructure these days...)
>

Sorry to chime in late here, away from my email most of the day.

I think the early promotion being proposed by Peter introduces less
optimization issues than the missing internalization on globals in ThinLTO.
For example, I would anticipate that the inline bonus for static functions
with a single callsite would likely provide any intended benefit during the
inlining performed in the -O2 -c compile step, and before bitcode/text
emission which is presumably when the early promotion would occur. I am not
sure about the other places mentioned by Mehdi, I am less familiar with
those, but presumably some could/should be done on static functions during
a -O2 compile step (e.g. dead argument elimination?).

For internalization, when I implemented the ThinLTO prototype I had played
with applying the single called static function bonus to functions noted as
having a single call in the summary (along with linker GC). It sounds like
Mehdi/Bruno are also looking at that. I've also not yet had a chance to do
optimization tuning on the upstream implementation, hopefully starting that
very soon though.

Teresa

>
> --
> Mehdi
>
>

-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |  408-460-2413
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160407/872a7ba3/attachment.html>