[PATCH] D17212: [ThinLTO] Support for call graph in per-module and combined summary.

Mon Feb 29 08:59:04 PST 2016

On Mon, Feb 29, 2016 at 8:45 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:

>
>
> Sent from my iPhone
>
> On Feb 29, 2016, at 8:18 AM, Teresa Johnson <tejohnson at google.com> wrote:
>
>
>
> On Fri, Feb 26, 2016 at 1:54 PM, Teresa Johnson <tejohnson at google.com>
> wrote:
>
>>
>>
>> On Fri, Feb 26, 2016 at 11:18 AM, Xinliang David Li <davidxl at google.com>
>> wrote:
>>
>>> In previous discussions with Teresa, we mentioned recording static
>>> callsite info in the summary -- this can drive more intelligent inline
>>> decisions to enable more linker GC.
>>>
>>
>> Yes, see WIP patch D17656.
>>
>>
>>> Regarding global variable analysis -- I think there is a way for thinLTO
>>> to work nicely in whole program mode.
>>>  The variable reference graph need edge that records access types (read,
>>> write, address taken). The plugin step will merge the information (with
>>> real linker feedback) ..
>>>
>>
>> Based on various potential use cases mentioned here and in D17656 and
>> discussions with Mehdi, I am going to try expanding this to be a
>> full-fledged reference graph. It may not include the access types
>> initially. Need to think about how best to represent the variable and
>> function reference information in the bitcode and the index...
>>
>
> I collected some stats about the number of GV references in functions
> across SPEC cpu2006, both as callees (expanded the current logic to
> identify invokes as well), as well as other GV references. When each
> function is written into bitcode, where I am scanning for current function
> summary info, I iterate over all instruction operands(), and look for any
> that are GlobalValues. Then I check if the function has a callsite and
> depending on whether the GV is the callee or not, increment the appropriate
> counter. I think that should get all of the GV references?
>
> Across all benchmarks and overall, most functions contain at least one
> call (74% overall). The situation for non-callee GV refs is a bit more
> mixed from benchmark to benchmark, but overall across all benchmarks most
> (77%) have none.
>
> Given that, here is some thinking about the best bitcode representation:
>
> For the current patch, each call is at least a tuple (callee value id,
> static #callsites) or a triple including the profile count with PGO. I
> suspect we don't care as much about the static # references or the profile
> count of those references for non-callsite GV references. And as David
> suggested, we may want to include other stats eventually for these normal
> references (e.g. read, write, address taken). So I don't think we want to
> use the same tuple or triple as we are using for the calls.
>
> Additionally, each bitcode record may only contain a single array, at the
> end. So we will need to encode this single array with info in the record on
> how to decode it. E.g. in the current patch I am using different abbrev ids
> for the no call, calls without pgo and calls with pgo cases. But as Mehdi
> noted, as we add additional cases this could result in a combinatoric
> explosion of abbrev ids. And in any case, we would need to note in the
> record how many entries are for callee references vs non-callee GV
> references. Since most functions have at least one call and most have no
> other GV references, we could optimize this a bit by including a VBR to
> indicate the number of non-callee references (in most cases this will be a
> single VBR chunk containing the value 0), and deducing the number of callee
> references from the size of the record minus the number of non-callee
> references. (In fact, I could get rid of the NOCALLS abbrev id variants in
> the current patch using this approach to deduce whether there are any calls
> encoded in the record.)  However, the disadvantage of this approach is that
> it requires an additional VBR per summary record.
>
> Another approach would be to include the references in a separate
> subsequent summary section record. Since most summaries contain calls, we
> may want to continue to include the calls in the main summary record, or
> could also encode those in a separate record type for consistency. The
> issue then becomes how to correlate those records with the main
> corresponding summary record. For per-module summaries, the function
> summary contains the value id, but including that in the reference records
> could require a bigger size overhead. Additionally, the combined summaries
> don't include value ids, they are associated with the corresponding VST
> entry by including the bitcode offset of the summary record in the combined
> VST. However, I think we can handle this by simply requiring that the
> reference records immediately follow the corresponding summary record.
> There is already precedence in the bitcode format for the order of the
> record being important for correlation - specifically the order of the
> MODULE_CODE_GLOBALVAR and MODULE_CODE_FUNCTION records implicitly encodes
> their value ids (for correlation with VST entries).
>
> Given the above, I would suggest that we have 4 abbrev ids for the
> function summary section. Using the PERMODULE case as an example, this
> would be:
>
> FS_PERMODULE_ENTRY: [valueid, linkage, instcount]
> FS_PERMODULE_CALLS: [n x (calleevalueid, callsitecount)]
> FS_PERMODULE_CALLS_PROFILE: [n x (calleevalueid, callsitecount,
> profilecount)]
> FS_PERMODULE_REFS: [n x refvalueid]
>
> where only one of FS_PERMODULE_CALLS/FS_PERMODULE_CALLS_PROFILE would be
> specified, and the ordering for each function would have to be ENTRY then
> the CALLS and REFS records, with no intervening summary records from other
> functions.
>
>
> Isn't it conceptually having a block per function?
> This is something it mentioned when we talked about the combinatorial
> issue with codes.
> What's the drawback of using blocks?
>

Just to clarify, I think you are suggesting something like the following
organization:

   <FUNCTION_SUMMARY_BLOCK ...
      <SUMMARY_BLOCK ...>
         <FS_PERMODULE_ENTRY: [valueid, linkage, instcount]>
         <FS_PERMODULE_CALLS: [n x (calleevalueid, callsitecount)]>
         <FS_PERMODULE_REFS: [n x refvalueid]>
      </SUMMARY_BLOCK ...>
      ...
   </FUNCTION_SUMMARY_BLOCK>

With one block per function?

Does that have any advantage over the scheme where we don't have the
additional level of block and the associated records must be adjacent?

Teresa

> Mehdi
>
> Also, the REFS case could eventually be extended to include other
> information for each refvalueid.
>
> Like I said earlier, though, since most functions have calls, we could
> reduce the required records by including the call array with the other
> summary info (e.g. combine the CALLS with the ENTRY as in the current
> patch). That would save a bit of space (one less abbrev id to define per
> bitcode file, and 3 less bits per function by avoiding specifying the extra
> abbrev id).
>
> Thoughts?
>
> Thanks,
> Teresa
>
>
>> Teresa
>>
>>
>>>
>>> David
>>>
>>> On Fri, Feb 26, 2016 at 11:06 AM, Mehdi Amini <mehdi.amini at apple.com>
>>> wrote:
>>>
>>>>
>>>> On Feb 26, 2016, at 10:58 AM, Xinliang David Li <davidxl at google.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> On Fri, Feb 26, 2016 at 10:53 AM, Mehdi Amini <mehdi.amini at apple.com>
>>>> wrote:
>>>>
>>>>>
>>>>> > On Feb 26, 2016, at 10:38 AM, Teresa Johnson <tejohnson at google.com>
>>>>> wrote:
>>>>> >
>>>>> > tejohnson added a comment.
>>>>> >
>>>>> > In http://reviews.llvm.org/D17212#362864, @joker.eph wrote:
>>>>> >
>>>>> >> How could we integrate accesses to global variable as part of this?
>>>>> >> It turns out that to be able to benefit from the linker information
>>>>> on what symbol is exported during the import, this is a must.
>>>>> >
>>>>> >
>>>>> > Well without it you still can see which function symbols will be
>>>>> exported, just not the variables, so you are running with less info and I
>>>>> guess need to assume that all static variables will be exposed and promote
>>>>> them.
>>>>>
>>>>> The scheme I am currently setting up is:
>>>>>
>>>>> 1) The linker gives us the list of symbols that need to be "preserved"
>>>>> (can't internalize)
>>>>> 2) Link the combined index
>>>>> 3) Compute the import list for every module *by just looking at the
>>>>> profile*
>>>>> 4) Do the promotion
>>>>>
>>>>> There is absolutely no assumption for the promotion (last step): you
>>>>> exactly know what will be imported by *every module*, and you can promote
>>>>> the optimal minimal amount of symbols.
>>>>>
>>>>> All of that is good and should work with your call-graph patch "as is".
>>>>>
>>>>> I'm looking to go two steps further during stage 3:
>>>>>
>>>>> 1) I want to drive the importing heuristic cost to actually take into
>>>>> account the need for promotion.
>>>>> I'll start some test on the extreme case by *forbiding* any promotion,
>>>>> i.e. if a function references an internal function or global, then it can't
>>>>> be imported in any other module. On the long term it may be interesting to
>>>>> include this in the importing threshold.
>>>>> This can be implemented with a flag or an int in the summary
>>>>> "numberOfInternalGlobalsReferenced", but will not be enough for step 2
>>>>> (below).
>>>>>
>>>>>
>>>> Interesting. What is the motivation for this type of tuning? Does is
>>>> try to preserve more variable analysis (e.g, aliasing) in backend
>>>> compilation?
>>>>
>>>>
>>>> Yes! LLVM optimization bails terribly with global variable, but likes a
>>>> lot "internal" variables. On some other (platform dependent) aspect,
>>>> accessing non-internal globals requires going through the GOT which is not
>>>> the case with internal. Internal globals can be totally eliminated or
>>>> turned into allocas, which in turn enables more other optimization.
>>>> In the same way, in many aspects, the optimizer benefits from internal
>>>> functions.
>>>>
>>>> Some things can be recovered with more effort (for instance in the
>>>> linker plugin for ThinLTO we're running :   internalize + global opt +
>>>> global DCE *before* doing any promotion). But we will fight assumptions in
>>>> the optimizer for a long time.
>>>>
>>>> All of this is driven by our current performance tuning of course.
>>>> We are seeing Full-LTO generating a binary almost two times smaller on
>>>> test-suite/trunk/SingleSource/Benchmarks/Adobe-C++/loop_unroll.cpp while
>>>> ThinLTO is on the same level as the regular O3.
>>>> (this is a single source benchmark, I wonder if the inliner heuristic
>>>> wouldn't need some tuning for O3 in general).
>>>>
>>>> If you have any other idea or input :)
>>>>
>>>> --
>>>> Mehdi
>>>>
>>>>
>>>>
>>>> 2) I want to benefit from the linker information from stage 1 to
>>>>> internalize symbols.
>>>>> It means that the information about the fact that a function is
>>>>> referencing an internal global *can't be in the summary* because the
>>>>> front-end does not know that the global will be internalized.
>>>>> This can be implemented by not having a "call graph" but a "reference
>>>>> graph" (not sure on the terminology): i.e. edges would be there for any
>>>>> uses of a symbol to another one and not only calls.
>>>>>
>>>>
>>>> Reference graph is what GCC uses :)
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>> > To refine that behavior for variables, yes, we'd need additional
>>>>> info in the summary.
>>>>> >
>>>>> > (For davidxl or anyone else who didn't see the IRC conversation,
>>>>> Mehdi is looking at doing pure summary-based importing decisions in the
>>>>> linker step, then giving this info to the ThinLTO backends to avoid
>>>>> promotion of local values that aren't exported. For a distributed build if
>>>>> we wanted to do it this way the importing decisions would all be made in
>>>>> the plugin step, then would need to be serialized out for the distributed
>>>>> backends to check.)
>>>>> >
>>>>> > Two possibilities depending on the fidelity of the info you think
>>>>> you need:
>>>>> >
>>>>> > 1. One possibility is to just put a flag in the function summary if
>>>>> it accesses *any* local variables, and adjust the importing threshold
>>>>> accordingly. Then in the ThinLTO backend for the exporting module you need
>>>>> to check which of your own functions are on the import list, and which
>>>>> local variables they access, and promote accordingly.
>>>>> >
>>>>> > 2. If it will be really beneficial to note exactly which local
>>>>> variables are accessed by which function, we'll need to broaden the edges
>>>>> list to include accesses to variables (I assume you only care about local
>>>>> variables here). E.g. the per-module summary edge list for a function would
>>>>> need to include value ids for any local variables referenced by that
>>>>> function (not sure that the other parts of the triple, the static and
>>>>> profile counts, are needed for that). Then in the combined VST we need to
>>>>> include entries for GUIDs of things that don't have a function summary, but
>>>>> are referenced by these edges. When accessing a function summary edge list
>>>>> for a candidate function to import, you could then see the GUID of any
>>>>> local variables accessed. You wouldn't know them by name, but if for
>>>>> example you wanted a heuristic like "if >=N hot import candidate functions
>>>>> from module M access a local variable with GUID G, go ahead and import
>>>>> those and let G be promoted by the backend (which like in approach #1 needs
>>>>> to check which local variables are accessed by any functions on an import
>>>>> list)".
>>>>> >
>>>>> > Obviously 1) is easier and cheaper space-wise. What are your
>>>>> thoughts?
>>>>>
>>>>>
>>>>> So 1) is cheaper, but 2) a lot more powerful as explained above :)
>>>>>
>>>>
>>>> yes -- we may find other good uses of it.
>>>>
>>>> David
>>>>
>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Mehdi
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>> 408-460-2413
>>
>
>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
> 408-460-2413
>
>

-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |  408-460-2413
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160229/439a5756/attachment.html>