[PATCH] D17212: [ThinLTO] Support for call graph in per-module and combined summary.

Tue Mar 1 08:27:54 PST 2016

On Tue, Mar 1, 2016 at 12:11 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:

>
> On Feb 29, 2016, at 12:20 PM, Teresa Johnson <tejohnson at google.com> wrote:
>
>
>
> On Mon, Feb 29, 2016 at 11:32 AM, Mehdi Amini <mehdi.amini at apple.com>
> wrote:
>
>>
>> On Feb 29, 2016, at 8:59 AM, Teresa Johnson <tejohnson at google.com> wrote:
>>
>>
>>
>> On Mon, Feb 29, 2016 at 8:45 AM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>>
>>>
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 29, 2016, at 8:18 AM, Teresa Johnson <tejohnson at google.com>
>>> wrote:
>>>
>>>
>>>
>>> On Fri, Feb 26, 2016 at 1:54 PM, Teresa Johnson <tejohnson at google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Feb 26, 2016 at 11:18 AM, Xinliang David Li <davidxl at google.com
>>>> > wrote:
>>>>
>>>>> In previous discussions with Teresa, we mentioned recording static
>>>>> callsite info in the summary -- this can drive more intelligent inline
>>>>> decisions to enable more linker GC.
>>>>>
>>>>
>>>> Yes, see WIP patch D17656.
>>>>
>>>>
>>>>> Regarding global variable analysis -- I think there is a way for
>>>>> thinLTO to work nicely in whole program mode.
>>>>>  The variable reference graph need edge that records access types
>>>>> (read, write, address taken). The plugin step will merge the information
>>>>> (with real linker feedback) ..
>>>>>
>>>>
>>>> Based on various potential use cases mentioned here and in D17656 and
>>>> discussions with Mehdi, I am going to try expanding this to be a
>>>> full-fledged reference graph. It may not include the access types
>>>> initially. Need to think about how best to represent the variable and
>>>> function reference information in the bitcode and the index...
>>>>
>>>
>>> I collected some stats about the number of GV references in functions
>>> across SPEC cpu2006, both as callees (expanded the current logic to
>>> identify invokes as well), as well as other GV references. When each
>>> function is written into bitcode, where I am scanning for current function
>>> summary info, I iterate over all instruction operands(), and look for any
>>> that are GlobalValues. Then I check if the function has a callsite and
>>> depending on whether the GV is the callee or not, increment the appropriate
>>> counter. I think that should get all of the GV references?
>>>
>>> Across all benchmarks and overall, most functions contain at least one
>>> call (74% overall). The situation for non-callee GV refs is a bit more
>>> mixed from benchmark to benchmark, but overall across all benchmarks most
>>> (77%) have none.
>>>
>>> Given that, here is some thinking about the best bitcode representation:
>>>
>>> For the current patch, each call is at least a tuple (callee value id,
>>> static #callsites) or a triple including the profile count with PGO. I
>>> suspect we don't care as much about the static # references or the profile
>>> count of those references for non-callsite GV references. And as David
>>> suggested, we may want to include other stats eventually for these normal
>>> references (e.g. read, write, address taken). So I don't think we want to
>>> use the same tuple or triple as we are using for the calls.
>>>
>>> Additionally, each bitcode record may only contain a single array, at
>>> the end. So we will need to encode this single array with info in the
>>> record on how to decode it. E.g. in the current patch I am using different
>>> abbrev ids for the no call, calls without pgo and calls with pgo cases. But
>>> as Mehdi noted, as we add additional cases this could result in a
>>> combinatoric explosion of abbrev ids. And in any case, we would need to
>>> note in the record how many entries are for callee references vs non-callee
>>> GV references. Since most functions have at least one call and most have no
>>> other GV references, we could optimize this a bit by including a VBR to
>>> indicate the number of non-callee references (in most cases this will be a
>>> single VBR chunk containing the value 0), and deducing the number of callee
>>> references from the size of the record minus the number of non-callee
>>> references. (In fact, I could get rid of the NOCALLS abbrev id variants in
>>> the current patch using this approach to deduce whether there are any calls
>>> encoded in the record.)  However, the disadvantage of this approach is that
>>> it requires an additional VBR per summary record.
>>>
>>> Another approach would be to include the references in a separate
>>> subsequent summary section record. Since most summaries contain calls, we
>>> may want to continue to include the calls in the main summary record, or
>>> could also encode those in a separate record type for consistency. The
>>> issue then becomes how to correlate those records with the main
>>> corresponding summary record. For per-module summaries, the function
>>> summary contains the value id, but including that in the reference records
>>> could require a bigger size overhead. Additionally, the combined summaries
>>> don't include value ids, they are associated with the corresponding VST
>>> entry by including the bitcode offset of the summary record in the combined
>>> VST. However, I think we can handle this by simply requiring that the
>>> reference records immediately follow the corresponding summary record.
>>> There is already precedence in the bitcode format for the order of the
>>> record being important for correlation - specifically the order of the
>>> MODULE_CODE_GLOBALVAR and MODULE_CODE_FUNCTION records implicitly encodes
>>> their value ids (for correlation with VST entries).
>>>
>>> Given the above, I would suggest that we have 4 abbrev ids for the
>>> function summary section. Using the PERMODULE case as an example, this
>>> would be:
>>>
>>> FS_PERMODULE_ENTRY: [valueid, linkage, instcount]
>>> FS_PERMODULE_CALLS: [n x (calleevalueid, callsitecount)]
>>> FS_PERMODULE_CALLS_PROFILE: [n x (calleevalueid, callsitecount,
>>> profilecount)]
>>> FS_PERMODULE_REFS: [n x refvalueid]
>>>
>>> where only one of FS_PERMODULE_CALLS/FS_PERMODULE_CALLS_PROFILE would be
>>> specified, and the ordering for each function would have to be ENTRY then
>>> the CALLS and REFS records, with no intervening summary records from other
>>> functions.
>>>
>>>
>>> Isn't it conceptually having a block per function?
>>> This is something it mentioned when we talked about the combinatorial
>>> issue with codes.
>>> What's the drawback of using blocks?
>>>
>>
>> Just to clarify, I think you are suggesting something like the following
>> organization:
>>
>>    <FUNCTION_SUMMARY_BLOCK ...
>>       <SUMMARY_BLOCK ...>
>>          <FS_PERMODULE_ENTRY: [valueid, linkage, instcount]>
>>          <FS_PERMODULE_CALLS: [n x (calleevalueid, callsitecount)]>
>>          <FS_PERMODULE_REFS: [n x refvalueid]>
>>       </SUMMARY_BLOCK ...>
>>       ...
>>    </FUNCTION_SUMMARY_BLOCK>
>>
>> With one block per function?
>>
>>
>> Yes.
>>
>> Does that have any advantage over the scheme where we don't have the
>> additional level of block and the associated records must be adjacent?
>>
>>
>> It seems more "structured" and representing more accurately the data
>> we're storing.
>> But I don't know all the trade-off associated with it? (for example is it
>> possible for the record in a SUMMARY_BLOCK to use abbrev id from the
>> FUNCTION_SUMMARY_BLOCK?)
>>
>
> No, according to
> http://llvm.org/docs/BitCodeFormat.html#define-abbrev-encoding: a
> DEFINE_ABBREV "exists inside this immediate block — it is not visible in
> subblocks or enclosing blocks". However, there is a special case where if
> the DEFINE_ABBREV is in the module-level BLOCKINFO block it is visible to
> the whole file. I've been putting the DEFINE_ABBREVS for the summary blocks
> inside those blocks since they were only needed locally, but that would
> have to be moved to the BLOCKINFO since we wouldn't want to have to repeat
> them in every subblock. This has the disadvantage of bloating out the
> abbrev ID space - right now the typical abbrev ID fits into a 3-bit space
> and I don't know how close we are to that. If we add more to the BLOCKINFO
> for example, then DEFINE_ABBREVs in nested blocks use higher numbers.
>
> I'm not sure we really need the subblocking shown above. Right now the
> records per function is well-defined. For example, with the above scheme,
> each function has a single ENTRY record, followed immediately by up to 1
> each CALL and REF records.
>
> I collected some more stats across SPEC cpu2006 comparing the relative
> overheads of the different schemes I described above. This assumes that
> each call contributes 2 entries to each record's array size since it is a
> tuple (assume non-PGO), and each non-call ref contributes 1 entry to the
> associated record's array size. I compared
> 1) keeping both calls and refs in a single record using a VBR to indicate
> the number of non-call refs (I looked at both 4-bit and 6-bit VBRs since
> this count is small or 0 in most cases), which means the array size is the
> total number of calls*2+refs for that function
> 2) keeping the calls in the main entry (so the array size is the number of
> calls*2), plus when the ref count is non-zero, a second record requiring a
> 3-bit abbrev id and where the array size is the number of non-call refs
> 3) using separate records for both calls and non-call refs. So each record
> (if the associated count is non-zero) contains the 3-bit abbrev id and the
> array size is the number of calls*2 or the number of non-call refs,
> respectively.
>
> I computed the actual VBR6 bit size required for each record for the array
> size in each case.
>
> The results show that 2) is clearly smallest across cpu2006. 3) is a bit
> bigger but might be worth it if splitting out calls into a separate record
> seems more consistent and clear.
>
>
> I don't have a strong opinion here.
> How do you plan to take into account references from Globals to function?
> (like a VTable for instance?)
>

This is not in the current patch I am working on since the only edges are
in function summaries. The easiest thing to do in this case is to include a
bit in the function summary indicating that there are uses that aren't
within functions (function has a use that is a non-BlockAddress Constant,
i.e. not on an Instruction), to ensure that such functions are treated
conservatively with respect to whether they have uses or not. This probably
belongs in a follow-up patch, e.g. maybe in D17656 which adds the total
count of static callsites to each function summary in the combined index.

>
>
> I'm not sure offhand how much overhead is added by keeping each function's
> summary in a separate sunblock.
>
>
> Don't waste time on that, I'm fine with the other solutions.
>
> Thanks!
>
> --
> Mehdi
>
>
>
>
> Teresa
>
>
>> I don't have a strong opinion on this, just wondering.
>>
>> --
>> Mehdi
>>
>>
>>
>>> Also, the REFS case could eventually be extended to include other
>>> information for each refvalueid.
>>>
>>> Like I said earlier, though, since most functions have calls, we could
>>> reduce the required records by including the call array with the other
>>> summary info (e.g. combine the CALLS with the ENTRY as in the current
>>> patch). That would save a bit of space (one less abbrev id to define per
>>> bitcode file, and 3 less bits per function by avoiding specifying the extra
>>> abbrev id).
>>>
>>> Thoughts?
>>>
>>> Thanks,
>>> Teresa
>>>
>>>
>>>> Teresa
>>>>
>>>>
>>>>>
>>>>> David
>>>>>
>>>>> On Fri, Feb 26, 2016 at 11:06 AM, Mehdi Amini <mehdi.amini at apple.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On Feb 26, 2016, at 10:58 AM, Xinliang David Li <davidxl at google.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 26, 2016 at 10:53 AM, Mehdi Amini <mehdi.amini at apple.com>
>>>>>>  wrote:
>>>>>>
>>>>>>>
>>>>>>> > On Feb 26, 2016, at 10:38 AM, Teresa Johnson <tejohnson at google.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > tejohnson added a comment.
>>>>>>> >
>>>>>>> > In http://reviews.llvm.org/D17212#362864, @joker.eph wrote:
>>>>>>> >
>>>>>>> >> How could we integrate accesses to global variable as part of
>>>>>>> this?
>>>>>>> >> It turns out that to be able to benefit from the linker
>>>>>>> information on what symbol is exported during the import, this is a must.
>>>>>>> >
>>>>>>> >
>>>>>>> > Well without it you still can see which function symbols will be
>>>>>>> exported, just not the variables, so you are running with less info and I
>>>>>>> guess need to assume that all static variables will be exposed and promote
>>>>>>> them.
>>>>>>>
>>>>>>> The scheme I am currently setting up is:
>>>>>>>
>>>>>>> 1) The linker gives us the list of symbols that need to be
>>>>>>> "preserved" (can't internalize)
>>>>>>> 2) Link the combined index
>>>>>>> 3) Compute the import list for every module *by just looking at the
>>>>>>> profile*
>>>>>>> 4) Do the promotion
>>>>>>>
>>>>>>> There is absolutely no assumption for the promotion (last step): you
>>>>>>> exactly know what will be imported by *every module*, and you can promote
>>>>>>> the optimal minimal amount of symbols.
>>>>>>>
>>>>>>> All of that is good and should work with your call-graph patch "as
>>>>>>> is".
>>>>>>>
>>>>>>> I'm looking to go two steps further during stage 3:
>>>>>>>
>>>>>>> 1) I want to drive the importing heuristic cost to actually take
>>>>>>> into account the need for promotion.
>>>>>>> I'll start some test on the extreme case by *forbiding* any
>>>>>>> promotion, i.e. if a function references an internal function or global,
>>>>>>> then it can't be imported in any other module. On the long term it may be
>>>>>>> interesting to include this in the importing threshold.
>>>>>>> This can be implemented with a flag or an int in the summary
>>>>>>> "numberOfInternalGlobalsReferenced", but will not be enough for step 2
>>>>>>> (below).
>>>>>>>
>>>>>>>
>>>>>> Interesting. What is the motivation for this type of tuning? Does is
>>>>>> try to preserve more variable analysis (e.g, aliasing) in backend
>>>>>> compilation?
>>>>>>
>>>>>>
>>>>>> Yes! LLVM optimization bails terribly with global variable, but likes
>>>>>> a lot "internal" variables. On some other (platform dependent) aspect,
>>>>>> accessing non-internal globals requires going through the GOT which is not
>>>>>> the case with internal. Internal globals can be totally eliminated or
>>>>>> turned into allocas, which in turn enables more other optimization.
>>>>>> In the same way, in many aspects, the optimizer benefits from
>>>>>> internal functions.
>>>>>>
>>>>>> Some things can be recovered with more effort (for instance in the
>>>>>> linker plugin for ThinLTO we're running :   internalize + global opt +
>>>>>> global DCE *before* doing any promotion). But we will fight assumptions in
>>>>>> the optimizer for a long time.
>>>>>>
>>>>>> All of this is driven by our current performance tuning of course.
>>>>>> We are seeing Full-LTO generating a binary almost two times smaller
>>>>>> on test-suite/trunk/SingleSource/Benchmarks/Adobe-C++/loop_unroll.cpp while
>>>>>> ThinLTO is on the same level as the regular O3.
>>>>>> (this is a single source benchmark, I wonder if the inliner heuristic
>>>>>> wouldn't need some tuning for O3 in general).
>>>>>>
>>>>>> If you have any other idea or input :)
>>>>>>
>>>>>> --
>>>>>> Mehdi
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2) I want to benefit from the linker information from stage 1 to
>>>>>>> internalize symbols.
>>>>>>> It means that the information about the fact that a function is
>>>>>>> referencing an internal global *can't be in the summary* because the
>>>>>>> front-end does not know that the global will be internalized.
>>>>>>> This can be implemented by not having a "call graph" but a
>>>>>>> "reference graph" (not sure on the terminology): i.e. edges would be there
>>>>>>> for any uses of a symbol to another one and not only calls.
>>>>>>>
>>>>>>
>>>>>> Reference graph is what GCC uses :)
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> > To refine that behavior for variables, yes, we'd need additional
>>>>>>> info in the summary.
>>>>>>> >
>>>>>>> > (For davidxl or anyone else who didn't see the IRC conversation,
>>>>>>> Mehdi is looking at doing pure summary-based importing decisions in the
>>>>>>> linker step, then giving this info to the ThinLTO backends to avoid
>>>>>>> promotion of local values that aren't exported. For a distributed build if
>>>>>>> we wanted to do it this way the importing decisions would all be made in
>>>>>>> the plugin step, then would need to be serialized out for the distributed
>>>>>>> backends to check.)
>>>>>>> >
>>>>>>> > Two possibilities depending on the fidelity of the info you think
>>>>>>> you need:
>>>>>>> >
>>>>>>> > 1. One possibility is to just put a flag in the function summary
>>>>>>> if it accesses *any* local variables, and adjust the importing threshold
>>>>>>> accordingly. Then in the ThinLTO backend for the exporting module you need
>>>>>>> to check which of your own functions are on the import list, and which
>>>>>>> local variables they access, and promote accordingly.
>>>>>>> >
>>>>>>> > 2. If it will be really beneficial to note exactly which local
>>>>>>> variables are accessed by which function, we'll need to broaden the edges
>>>>>>> list to include accesses to variables (I assume you only care about local
>>>>>>> variables here). E.g. the per-module summary edge list for a function would
>>>>>>> need to include value ids for any local variables referenced by that
>>>>>>> function (not sure that the other parts of the triple, the static and
>>>>>>> profile counts, are needed for that). Then in the combined VST we need to
>>>>>>> include entries for GUIDs of things that don't have a function summary, but
>>>>>>> are referenced by these edges. When accessing a function summary edge list
>>>>>>> for a candidate function to import, you could then see the GUID of any
>>>>>>> local variables accessed. You wouldn't know them by name, but if for
>>>>>>> example you wanted a heuristic like "if >=N hot import candidate functions
>>>>>>> from module M access a local variable with GUID G, go ahead and import
>>>>>>> those and let G be promoted by the backend (which like in approach #1 needs
>>>>>>> to check which local variables are accessed by any functions on an import
>>>>>>> list)".
>>>>>>> >
>>>>>>> > Obviously 1) is easier and cheaper space-wise. What are your
>>>>>>> thoughts?
>>>>>>>
>>>>>>>
>>>>>>> So 1) is cheaper, but 2) a lot more powerful as explained above :)
>>>>>>>
>>>>>>
>>>>>> yes -- we may find other good uses of it.
>>>>>>
>>>>>> David
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Mehdi
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>>>> 408-460-2413
>>>>
>>>
>>>
>>>
>>> --
>>> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>>> 408-460-2413
>>>
>>>
>>
>>
>> --
>> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>> 408-460-2413
>>
>>
>>
>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
> 408-460-2413
>
>
>

-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |  408-460-2413
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160301/5bcd2bff/attachment.html>