[PATCH] D17212: [ThinLTO] Support for call graph in per-module and combined summary.
Teresa Johnson via llvm-commits
llvm-commits at lists.llvm.org
Tue Mar 1 09:01:45 PST 2016
On Tue, Mar 1, 2016 at 8:42 AM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>
> On Mar 1, 2016, at 8:27 AM, Teresa Johnson <tejohnson at google.com> wrote:
>
>
>
> On Tue, Mar 1, 2016 at 12:11 AM, Mehdi Amini <mehdi.amini at apple.com>
> wrote:
>
>>
>> On Feb 29, 2016, at 12:20 PM, Teresa Johnson <tejohnson at google.com>
>> wrote:
>>
>>
>>
>> On Mon, Feb 29, 2016 at 11:32 AM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>>
>>>
>>> On Feb 29, 2016, at 8:59 AM, Teresa Johnson <tejohnson at google.com>
>>> wrote:
>>>
>>>
>>>
>>> On Mon, Feb 29, 2016 at 8:45 AM, Mehdi Amini <mehdi.amini at apple.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Feb 29, 2016, at 8:18 AM, Teresa Johnson <tejohnson at google.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> On Fri, Feb 26, 2016 at 1:54 PM, Teresa Johnson <tejohnson at google.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Feb 26, 2016 at 11:18 AM, Xinliang David Li <
>>>>> davidxl at google.com> wrote:
>>>>>
>>>>>> In previous discussions with Teresa, we mentioned recording static
>>>>>> callsite info in the summary -- this can drive more intelligent inline
>>>>>> decisions to enable more linker GC.
>>>>>>
>>>>>
>>>>> Yes, see WIP patch D17656.
>>>>>
>>>>>
>>>>>> Regarding global variable analysis -- I think there is a way for
>>>>>> thinLTO to work nicely in whole program mode.
>>>>>> The variable reference graph need edge that records access types
>>>>>> (read, write, address taken). The plugin step will merge the information
>>>>>> (with real linker feedback) ..
>>>>>>
>>>>>
>>>>> Based on various potential use cases mentioned here and in D17656 and
>>>>> discussions with Mehdi, I am going to try expanding this to be a
>>>>> full-fledged reference graph. It may not include the access types
>>>>> initially. Need to think about how best to represent the variable and
>>>>> function reference information in the bitcode and the index...
>>>>>
>>>>
>>>> I collected some stats about the number of GV references in functions
>>>> across SPEC cpu2006, both as callees (expanded the current logic to
>>>> identify invokes as well), as well as other GV references. When each
>>>> function is written into bitcode, where I am scanning for current function
>>>> summary info, I iterate over all instruction operands(), and look for any
>>>> that are GlobalValues. Then I check if the function has a callsite and
>>>> depending on whether the GV is the callee or not, increment the appropriate
>>>> counter. I think that should get all of the GV references?
>>>>
>>>> Across all benchmarks and overall, most functions contain at least one
>>>> call (74% overall). The situation for non-callee GV refs is a bit more
>>>> mixed from benchmark to benchmark, but overall across all benchmarks most
>>>> (77%) have none.
>>>>
>>>> Given that, here is some thinking about the best bitcode representation:
>>>>
>>>> For the current patch, each call is at least a tuple (callee value id,
>>>> static #callsites) or a triple including the profile count with PGO. I
>>>> suspect we don't care as much about the static # references or the profile
>>>> count of those references for non-callsite GV references. And as David
>>>> suggested, we may want to include other stats eventually for these normal
>>>> references (e.g. read, write, address taken). So I don't think we want to
>>>> use the same tuple or triple as we are using for the calls.
>>>>
>>>> Additionally, each bitcode record may only contain a single array, at
>>>> the end. So we will need to encode this single array with info in the
>>>> record on how to decode it. E.g. in the current patch I am using different
>>>> abbrev ids for the no call, calls without pgo and calls with pgo cases. But
>>>> as Mehdi noted, as we add additional cases this could result in a
>>>> combinatoric explosion of abbrev ids. And in any case, we would need to
>>>> note in the record how many entries are for callee references vs non-callee
>>>> GV references. Since most functions have at least one call and most have no
>>>> other GV references, we could optimize this a bit by including a VBR to
>>>> indicate the number of non-callee references (in most cases this will be a
>>>> single VBR chunk containing the value 0), and deducing the number of callee
>>>> references from the size of the record minus the number of non-callee
>>>> references. (In fact, I could get rid of the NOCALLS abbrev id variants in
>>>> the current patch using this approach to deduce whether there are any calls
>>>> encoded in the record.) However, the disadvantage of this approach is that
>>>> it requires an additional VBR per summary record.
>>>>
>>>> Another approach would be to include the references in a separate
>>>> subsequent summary section record. Since most summaries contain calls, we
>>>> may want to continue to include the calls in the main summary record, or
>>>> could also encode those in a separate record type for consistency. The
>>>> issue then becomes how to correlate those records with the main
>>>> corresponding summary record. For per-module summaries, the function
>>>> summary contains the value id, but including that in the reference records
>>>> could require a bigger size overhead. Additionally, the combined summaries
>>>> don't include value ids, they are associated with the corresponding VST
>>>> entry by including the bitcode offset of the summary record in the combined
>>>> VST. However, I think we can handle this by simply requiring that the
>>>> reference records immediately follow the corresponding summary record.
>>>> There is already precedence in the bitcode format for the order of the
>>>> record being important for correlation - specifically the order of the
>>>> MODULE_CODE_GLOBALVAR and MODULE_CODE_FUNCTION records implicitly encodes
>>>> their value ids (for correlation with VST entries).
>>>>
>>>> Given the above, I would suggest that we have 4 abbrev ids for the
>>>> function summary section. Using the PERMODULE case as an example, this
>>>> would be:
>>>>
>>>> FS_PERMODULE_ENTRY: [valueid, linkage, instcount]
>>>> FS_PERMODULE_CALLS: [n x (calleevalueid, callsitecount)]
>>>> FS_PERMODULE_CALLS_PROFILE: [n x (calleevalueid, callsitecount,
>>>> profilecount)]
>>>> FS_PERMODULE_REFS: [n x refvalueid]
>>>>
>>>> where only one of FS_PERMODULE_CALLS/FS_PERMODULE_CALLS_PROFILE would
>>>> be specified, and the ordering for each function would have to be ENTRY
>>>> then the CALLS and REFS records, with no intervening summary records from
>>>> other functions.
>>>>
>>>>
>>>> Isn't it conceptually having a block per function?
>>>> This is something it mentioned when we talked about the combinatorial
>>>> issue with codes.
>>>> What's the drawback of using blocks?
>>>>
>>>
>>> Just to clarify, I think you are suggesting something like the following
>>> organization:
>>>
>>> <FUNCTION_SUMMARY_BLOCK ...
>>> <SUMMARY_BLOCK ...>
>>> <FS_PERMODULE_ENTRY: [valueid, linkage, instcount]>
>>> <FS_PERMODULE_CALLS: [n x (calleevalueid, callsitecount)]>
>>> <FS_PERMODULE_REFS: [n x refvalueid]>
>>> </SUMMARY_BLOCK ...>
>>> ...
>>> </FUNCTION_SUMMARY_BLOCK>
>>>
>>> With one block per function?
>>>
>>>
>>> Yes.
>>>
>>> Does that have any advantage over the scheme where we don't have the
>>> additional level of block and the associated records must be adjacent?
>>>
>>>
>>> It seems more "structured" and representing more accurately the data
>>> we're storing.
>>> But I don't know all the trade-off associated with it? (for example is
>>> it possible for the record in a SUMMARY_BLOCK to use abbrev id from the
>>> FUNCTION_SUMMARY_BLOCK?)
>>>
>>
>> No, according to
>> http://llvm.org/docs/BitCodeFormat.html#define-abbrev-encoding: a
>> DEFINE_ABBREV "exists inside this immediate block — it is not visible in
>> subblocks or enclosing blocks". However, there is a special case where if
>> the DEFINE_ABBREV is in the module-level BLOCKINFO block it is visible to
>> the whole file. I've been putting the DEFINE_ABBREVS for the summary blocks
>> inside those blocks since they were only needed locally, but that would
>> have to be moved to the BLOCKINFO since we wouldn't want to have to repeat
>> them in every subblock. This has the disadvantage of bloating out the
>> abbrev ID space - right now the typical abbrev ID fits into a 3-bit space
>> and I don't know how close we are to that. If we add more to the BLOCKINFO
>> for example, then DEFINE_ABBREVs in nested blocks use higher numbers.
>>
>> I'm not sure we really need the subblocking shown above. Right now the
>> records per function is well-defined. For example, with the above scheme,
>> each function has a single ENTRY record, followed immediately by up to 1
>> each CALL and REF records.
>>
>> I collected some more stats across SPEC cpu2006 comparing the relative
>> overheads of the different schemes I described above. This assumes that
>> each call contributes 2 entries to each record's array size since it is a
>> tuple (assume non-PGO), and each non-call ref contributes 1 entry to the
>> associated record's array size. I compared
>> 1) keeping both calls and refs in a single record using a VBR to indicate
>> the number of non-call refs (I looked at both 4-bit and 6-bit VBRs since
>> this count is small or 0 in most cases), which means the array size is the
>> total number of calls*2+refs for that function
>> 2) keeping the calls in the main entry (so the array size is the number
>> of calls*2), plus when the ref count is non-zero, a second record requiring
>> a 3-bit abbrev id and where the array size is the number of non-call refs
>> 3) using separate records for both calls and non-call refs. So each
>> record (if the associated count is non-zero) contains the 3-bit abbrev id
>> and the array size is the number of calls*2 or the number of non-call refs,
>> respectively.
>>
>> I computed the actual VBR6 bit size required for each record for the
>> array size in each case.
>>
>> The results show that 2) is clearly smallest across cpu2006. 3) is a bit
>> bigger but might be worth it if splitting out calls into a separate record
>> seems more consistent and clear.
>>
>>
>> I don't have a strong opinion here.
>> How do you plan to take into account references from Globals to function?
>> (like a VTable for instance?)
>>
>
> This is not in the current patch I am working on since the only edges are
> in function summaries. The easiest thing to do in this case is to include a
> bit in the function summary indicating that there are uses that aren't
> within functions (function has a use that is a non-BlockAddress Constant,
> i.e. not on an Instruction), to ensure that such functions are treated
> conservatively with respect to whether they have uses or not. This probably
> belongs in a follow-up patch, e.g. maybe in D17656 which adds the total
> count of static callsites to each function summary in the combined index.
>
>
> What you are describing won't record the reference edges right? IIUC this
> will say if yes or no a function is referenced from somewhere.
>
Right, we would need to add a record to record edges from module-level
global references to such values to get more information, since there is no
associated function summary.
But after sending this I realized we can't just record this flag directly
when compiling into bitcode since the reference could be in another module.
So to capture such module-level global references, we will need to at least
record in the per-module bitcode which value ids are referenced at the
module-level. Either as a list of value ids per module, which would be used
to set the above bit in the combined index function summary, or as a full
reference list of edges from global value ids to the referenced value id.
Do you think it will be much more useful to know which global value ids are
referencing the function at the module level, or just record that there are
such references to the function? I suppose if we want to also record
module-level references to non-function GVs we can't use a function summary
bit anyway.
To get the full info we'd either need to generalize the function summary
section to be a global value section, and include a new record for the
module-level reference list, or create a new summary section for these
references.
Teresa
> --
> Mehdi
>
>
>
>
>
>>
>> I'm not sure offhand how much overhead is added by keeping each
>> function's summary in a separate sunblock.
>>
>>
>> Don't waste time on that, I'm fine with the other solutions.
>>
>> Thanks!
>>
>> --
>> Mehdi
>>
>>
>>
>>
>> Teresa
>>
>>
>>> I don't have a strong opinion on this, just wondering.
>>>
>>> --
>>> Mehdi
>>>
>>>
>>>
>>>> Also, the REFS case could eventually be extended to include other
>>>> information for each refvalueid.
>>>>
>>>> Like I said earlier, though, since most functions have calls, we could
>>>> reduce the required records by including the call array with the other
>>>> summary info (e.g. combine the CALLS with the ENTRY as in the current
>>>> patch). That would save a bit of space (one less abbrev id to define per
>>>> bitcode file, and 3 less bits per function by avoiding specifying the extra
>>>> abbrev id).
>>>>
>>>> Thoughts?
>>>>
>>>> Thanks,
>>>> Teresa
>>>>
>>>>
>>>>> Teresa
>>>>>
>>>>>
>>>>>>
>>>>>> David
>>>>>>
>>>>>> On Fri, Feb 26, 2016 at 11:06 AM, Mehdi Amini <mehdi.amini at apple.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Feb 26, 2016, at 10:58 AM, Xinliang David Li <davidxl at google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 26, 2016 at 10:53 AM, Mehdi Amini <mehdi.amini at apple.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> > On Feb 26, 2016, at 10:38 AM, Teresa Johnson <
>>>>>>>> tejohnson at google.com> wrote:
>>>>>>>> >
>>>>>>>> > tejohnson added a comment.
>>>>>>>> >
>>>>>>>> > In http://reviews.llvm.org/D17212#362864, @joker.eph wrote:
>>>>>>>> >
>>>>>>>> >> How could we integrate accesses to global variable as part of
>>>>>>>> this?
>>>>>>>> >> It turns out that to be able to benefit from the linker
>>>>>>>> information on what symbol is exported during the import, this is a must.
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > Well without it you still can see which function symbols will be
>>>>>>>> exported, just not the variables, so you are running with less info and I
>>>>>>>> guess need to assume that all static variables will be exposed and promote
>>>>>>>> them.
>>>>>>>>
>>>>>>>> The scheme I am currently setting up is:
>>>>>>>>
>>>>>>>> 1) The linker gives us the list of symbols that need to be
>>>>>>>> "preserved" (can't internalize)
>>>>>>>> 2) Link the combined index
>>>>>>>> 3) Compute the import list for every module *by just looking at the
>>>>>>>> profile*
>>>>>>>> 4) Do the promotion
>>>>>>>>
>>>>>>>> There is absolutely no assumption for the promotion (last step):
>>>>>>>> you exactly know what will be imported by *every module*, and you can
>>>>>>>> promote the optimal minimal amount of symbols.
>>>>>>>>
>>>>>>>> All of that is good and should work with your call-graph patch "as
>>>>>>>> is".
>>>>>>>>
>>>>>>>> I'm looking to go two steps further during stage 3:
>>>>>>>>
>>>>>>>> 1) I want to drive the importing heuristic cost to actually take
>>>>>>>> into account the need for promotion.
>>>>>>>> I'll start some test on the extreme case by *forbiding* any
>>>>>>>> promotion, i.e. if a function references an internal function or global,
>>>>>>>> then it can't be imported in any other module. On the long term it may be
>>>>>>>> interesting to include this in the importing threshold.
>>>>>>>> This can be implemented with a flag or an int in the summary
>>>>>>>> "numberOfInternalGlobalsReferenced", but will not be enough for step 2
>>>>>>>> (below).
>>>>>>>>
>>>>>>>>
>>>>>>> Interesting. What is the motivation for this type of tuning? Does is
>>>>>>> try to preserve more variable analysis (e.g, aliasing) in backend
>>>>>>> compilation?
>>>>>>>
>>>>>>>
>>>>>>> Yes! LLVM optimization bails terribly with global variable, but
>>>>>>> likes a lot "internal" variables. On some other (platform dependent)
>>>>>>> aspect, accessing non-internal globals requires going through the GOT which
>>>>>>> is not the case with internal. Internal globals can be totally eliminated
>>>>>>> or turned into allocas, which in turn enables more other optimization.
>>>>>>> In the same way, in many aspects, the optimizer benefits from
>>>>>>> internal functions.
>>>>>>>
>>>>>>> Some things can be recovered with more effort (for instance in the
>>>>>>> linker plugin for ThinLTO we're running : internalize + global opt +
>>>>>>> global DCE *before* doing any promotion). But we will fight assumptions in
>>>>>>> the optimizer for a long time.
>>>>>>>
>>>>>>> All of this is driven by our current performance tuning of course.
>>>>>>> We are seeing Full-LTO generating a binary almost two times smaller
>>>>>>> on test-suite/trunk/SingleSource/Benchmarks/Adobe-C++/loop_unroll.cpp while
>>>>>>> ThinLTO is on the same level as the regular O3.
>>>>>>> (this is a single source benchmark, I wonder if the inliner
>>>>>>> heuristic wouldn't need some tuning for O3 in general).
>>>>>>>
>>>>>>> If you have any other idea or input :)
>>>>>>>
>>>>>>> --
>>>>>>> Mehdi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2) I want to benefit from the linker information from stage 1 to
>>>>>>>> internalize symbols.
>>>>>>>> It means that the information about the fact that a function is
>>>>>>>> referencing an internal global *can't be in the summary* because the
>>>>>>>> front-end does not know that the global will be internalized.
>>>>>>>> This can be implemented by not having a "call graph" but a
>>>>>>>> "reference graph" (not sure on the terminology): i.e. edges would be there
>>>>>>>> for any uses of a symbol to another one and not only calls.
>>>>>>>>
>>>>>>>
>>>>>>> Reference graph is what GCC uses :)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> > To refine that behavior for variables, yes, we'd need additional
>>>>>>>> info in the summary.
>>>>>>>> >
>>>>>>>> > (For davidxl or anyone else who didn't see the IRC conversation,
>>>>>>>> Mehdi is looking at doing pure summary-based importing decisions in the
>>>>>>>> linker step, then giving this info to the ThinLTO backends to avoid
>>>>>>>> promotion of local values that aren't exported. For a distributed build if
>>>>>>>> we wanted to do it this way the importing decisions would all be made in
>>>>>>>> the plugin step, then would need to be serialized out for the distributed
>>>>>>>> backends to check.)
>>>>>>>> >
>>>>>>>> > Two possibilities depending on the fidelity of the info you think
>>>>>>>> you need:
>>>>>>>> >
>>>>>>>> > 1. One possibility is to just put a flag in the function summary
>>>>>>>> if it accesses *any* local variables, and adjust the importing threshold
>>>>>>>> accordingly. Then in the ThinLTO backend for the exporting module you need
>>>>>>>> to check which of your own functions are on the import list, and which
>>>>>>>> local variables they access, and promote accordingly.
>>>>>>>> >
>>>>>>>> > 2. If it will be really beneficial to note exactly which local
>>>>>>>> variables are accessed by which function, we'll need to broaden the edges
>>>>>>>> list to include accesses to variables (I assume you only care about local
>>>>>>>> variables here). E.g. the per-module summary edge list for a function would
>>>>>>>> need to include value ids for any local variables referenced by that
>>>>>>>> function (not sure that the other parts of the triple, the static and
>>>>>>>> profile counts, are needed for that). Then in the combined VST we need to
>>>>>>>> include entries for GUIDs of things that don't have a function summary, but
>>>>>>>> are referenced by these edges. When accessing a function summary edge list
>>>>>>>> for a candidate function to import, you could then see the GUID of any
>>>>>>>> local variables accessed. You wouldn't know them by name, but if for
>>>>>>>> example you wanted a heuristic like "if >=N hot import candidate functions
>>>>>>>> from module M access a local variable with GUID G, go ahead and import
>>>>>>>> those and let G be promoted by the backend (which like in approach #1 needs
>>>>>>>> to check which local variables are accessed by any functions on an import
>>>>>>>> list)".
>>>>>>>> >
>>>>>>>> > Obviously 1) is easier and cheaper space-wise. What are your
>>>>>>>> thoughts?
>>>>>>>>
>>>>>>>>
>>>>>>>> So 1) is cheaper, but 2) a lot more powerful as explained above :)
>>>>>>>>
>>>>>>>
>>>>>>> yes -- we may find other good uses of it.
>>>>>>>
>>>>>>> David
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Mehdi
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Teresa Johnson | Software Engineer | tejohnson at google.com |
>>>>> 408-460-2413
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Teresa Johnson | Software Engineer | tejohnson at google.com |
>>>> 408-460-2413
>>>>
>>>>
>>>
>>>
>>> --
>>> Teresa Johnson | Software Engineer | tejohnson at google.com |
>>> 408-460-2413
>>>
>>>
>>>
>>
>>
>> --
>> Teresa Johnson | Software Engineer | tejohnson at google.com |
>> 408-460-2413
>>
>>
>>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson at google.com |
> 408-460-2413
>
>
>
--
Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20160301/d54a4d7c/attachment.html>
More information about the llvm-commits
mailing list