[PATCH] D11722: [ThinLTO] Bitcode reading/writing support for ThinLTO function summary/index
Teresa Johnson via llvm-commits
llvm-commits at lists.llvm.org
Wed Aug 19 11:51:40 PDT 2015
On Wed, Aug 19, 2015 at 10:12 AM, Duncan P. N. Exon Smith
<dexonsmith at apple.com> wrote:
>
>> On 2015-Aug-19, at 09:07, Teresa Johnson <tejohnson at google.com> wrote:
>>
>> On Tue, Aug 18, 2015 at 7:58 PM, Duncan P. N. Exon Smith
>> <dexonsmith at apple.com> wrote:
>>>
>>>> On 2015-Aug-13, at 10:22, Teresa Johnson <tejohnson at google.com> wrote:
>>>>
>>>>
>>>>
>>>> On Thu, Aug 13, 2015 at 9:42 AM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
>>>>
>>>>> On 2015-Aug-12, at 21:23, Teresa Johnson <tejohnson at google.com> wrote:
>>>>>
>>>>> tejohnson updated the summary for this revision.
>>>>> tejohnson updated this revision to Diff 32027.
>>>>> tejohnson added a comment.
>>>>>
>>>>> Removed native object wrapper support from this patch.
>>>>>
>>>>>
>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D11722&d=BQIFaQ&c=eEvniauFctOgLOKGJOplqw&r=vftFLnHiqThJHdL0qZWd_Vo12qdMVOZDFnNVBhP9GKA&m=xT8yH7qMQ6_t61J8oyWNyy_aZHTAO5eaqJolhFZKXI0&s=4iNlSRUosXZmoWBVA1B4l8yk8oznZyl5t8sohjd1fEY&e=
>>>>>
>>>>> Files:
>>>>> include/llvm/Bitcode/LLVMBitCodes.h
>>>>> include/llvm/Bitcode/ReaderWriter.h
>>>>> lib/Bitcode/Reader/BitcodeReader.cpp
>>>>> lib/Bitcode/Writer/BitcodeWriter.cpp
>>>>> tools/llvm-bcanalyzer/llvm-bcanalyzer.cpp
>>>>>
>>>>> <D11722.32027.patch>
>>>>
>>>> (Sorry I'm so late getting to this patch.)
>>>>
>>>> Hi Duncan,
>>>> Thanks for the comments.
>>>>
>>>>
>>>> I'm concerned about the direction. IIUC, the purpose of this is faster
>>>> lazy-loading. Why do this in a side-channel? Why not optimize the
>>>> existing lazy-loader? I imagine the LTO model implemented by the gold
>>>> plugin would benefit from this, for example. If we need an extra index
>>>> to make lazy-loading "fast enough", it seems like the index should be
>>>> available (as an opt-in for time/space tradeoff) for other consumers as
>>>> well. Alternatively, maybe there's something we can change about how
>>>> the existing lazy-loader works (or the layout of bitcode) to be more
>>>> amenable to the partial loading ThinLTO needs.
>>>>
>>>> Only one part of the index (the function's bitcode offset) is used to do the fast loading of the function from a given module. Much of the other information in the ThinLTO sections is to identify/locate the module to import from (via the module strtab index which is populated when creating the combined (global) index), and information used in making importing decisions from another module (like the function's size, hotness when there is profile, etc, which we don't have since we haven't yet parsed that other module).
>>>>
>>>> The other issue with using the existing lazy loading support is that we may import from a number of other modules in some interleaved fashion, so we may open/import a function/close a module multiple times. My understanding is that the lazy loading is more for the case of loading a few different functions in sequence.
>>>>
>>>> So for example, if we have:
>>>>
>>>> A.cc:
>>>>
>>>> a() {
>>>> b1();
>>>> ...
>>>> c1();
>>>> ...
>>>> b2();
>>>> ...
>>>> c2();
>>>> }
>>>>
>>>> B.cc:
>>>>
>>>> b1() { d1(); }
>>>> b2() { ... }
>>>>
>>>> C.cc:
>>>>
>>>> c1() { d2(); }
>>>> c2() { ... }
>>>>
>>>> D.cc:
>>>>
>>>> d1() { ... }
>>>> d2() { ... }
>>>>
>>>>
>>>> When compiling A.cc through the parallel backend's importing pass, we may have the following sequence of events:
>>>>
>>>> Consider importing b1 -> Decide to import b1 (exposes call to d1)
>>>> Consider importing d1 -> Decide to import d1
>>>> Consider importing c1 -> Decide to import c1 (exposes call to d2)
>>>> Consider importing d2 -> Decide to import d2
>>>> Consider importing b2 -> Decide *not* to import b2
>>>> Consider importing c2 -> Decide to import c2
>>>>
>>>> For each of the inter-module calls considered in some priority based order, we would look up the callee in the combined index, which provides some information such as the callee's size, hotness, etc. A decision on whether to import is made based on that information (without doing any parsing of the callee's module). If we decide to import, then the combined index entry for the callee provides the callee module path and the bitcode index, which is used to open the module and import just that function, then the callee's module is closed. Some of the lazy loading support could potentially be used in place of the bitcode index, although I am not convinced it is suited for that purpose. In any case, that is just a small part of what is contained in the function summary/index as described earlier.
>>>>
>>>> On the flip side, the bitcode index could be used by the lazy loader, in place of building the index on the fly. I'm not sure how much benefit there is to lazy loading, which eventually needs to read the entire module once a function is materialized.
>>>
>>> It seems generally useful to be able to cherry-pick functions out of a
>>> large bitcode file without scanning the whole thing. It's really cool
>>> that you can teardown and buildup the BitcodeReader, swapping different
>>> modules in and out.
>>>
>>> I don't like this being tied to ThinLTO, since I think it's useful for
>>> other consumers as well. To start, the old/normal lazy loader should
>>> use the index when it's available. I'm sure other people will think of
>>> other crazy uses.
>>
>> I've been thinking this morning about the best way to leverage the
>> ThinLTO index in the existing lazy loader and have some ideas. But I
>> think that just applies to using the bitcode offset of the function
>> body block.
>
> That might be it for now (besides metadata, addressed below).
>
>> Do you think that it is interesting to put the rest of the summary
>> information I am going to use for importing decisions like the
>> instruction count, profile hotness, other function characteristics
>> (queried from other modules, via the combined index) in the bitcode in
>> a non-ThinLTO-specific way? Perhaps they could still be optional and
>> currently only produced with -fthinlto, but the block/record names not
>> contain "ThinLTO".
>
> That could work. I imagine other LTO use cases (e.g., linker-driven
> partitioning) would also want the profile information.
Here's what I am currently thinking:
In order to make the function's bitcode index usable for normal lazy
function loading, move it out of the function summary block and into
the value symbol table (VST). The current VST code entry records
contain the valueid and value name string. This would be expanded to
be valueid, bitcode offset, value name string. The VST is positioned
in the bitcode right before the function blocks, and the
DeferredFunctionInfo map can be populated directly from these bitcode
offsets. Since the VST reading locates the GVs for each entry, I think
the existing FunctionsWithBodies used for lazy reading could go, along
with rememberAndSkipFunctionBody and findFunctionInStream.
For ThinLTO, instead of needing the importer to pass in both the
function to be imported and its bitcode offset, we only need the
function as ThinLTO also needs to read the VST for the module being
imported from, and would get the bitcode offset that way. It then has
the same behavior as lazy function reading, but only materializing the
single function of interest.
We still need the function summary data for importing decisions (was
THINLTO_FUNCTION_SUMMARY_BLOCK), but it can be renamed to the more
generic FUNCTION_SUMMARY_BLOCK. And it no longer contains the
function's bitcode offset (just the index into the module path string
table and the summary data).
We also still need the module path string table (only shows up in the
combined index though, not in the per-module index). Still seems
specific to ThinLTO, but we could be consistent and rename to
MODULE_STRTAB_BLOCK (i.e. strip off the leading THINLTO_).
Finally, we still need a way to associate each VST entry with the
corresponding function summary block offset. This was in the
THINLTO_SYMTAB_BLOCK. If we are making all of this non-ThinLTO
specific, I think it makes the most sense to merge this info into the
VST, but as separate records that are only enabled currently under
-fthinlto. E.g. the new optional VST records would be something like
VST_SUMMARY_ENTRY and map from the corresponding VST_CODE_ENTRY's
valueid to the summary block's offset for that function. Alternatively
this could be merged with the VST_CODE_ENTRY, but I didn't want to
bloat out those records by default.
And the outer THINLTO_BLOCK encapsulating all the old ThinLTO blocks goes away.
>
>>>
>>> One possible use: could llvm-ar use this index (when available) instead
>>> of walking through all the functions?
>>
>> I couldn't find where llvm-ar walks through the functions. What ar
>> operation is doing that?
>
> Oops, sorry. I meant llvm-nm, which uses IRObjectFile.
Got it. I looked at llvm-nm and it is building a lazy bitcode reader
object. I checked and it doesn't appear to materialize any functions
when walking the symbols and printing their types. So I don't know
that the bitcode offset / lazy function reading changes discussed
above help much. Unless you are thinking of putting some of the symbol
type info into the summary data directly so that no GVs need to be
created. Offhand I'm not sure how much that helps (the GVs are already
created by the time we parse the VST e.g. when reading the
MODULE_CODE_FUNCTION records with the decls).
>
>>>> Not included in this patch is support for lazy-loading of metadata,
>>>> something you suggested you had support for in your prototype (and I
>>>> assume ThinLTO relies on it). I think it's particularly important that
>>>> whatever you do there isn't ThinLTO-specific.
>>>>
>>>>
>>>> Right, that is needed when we do the importing. This initial patch is just to write/read the function summary/index. I had briefly outlined the way I was hoping to stage patches earlier today (https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.llvm.org_pipermail_llvm-2Ddev_2015-2DAugust_089222.html&d=BQIFaQ&c=eEvniauFctOgLOKGJOplqw&r=vftFLnHiqThJHdL0qZWd_Vo12qdMVOZDFnNVBhP9GKA&m=xT8yH7qMQ6_t61J8oyWNyy_aZHTAO5eaqJolhFZKXI0&s=vP5KTmztSJsRAt8RqxDhmvarbyxX52vJed8RmR515JA&e= ). I was planning to do the importing as a separate set of patches once I have the infrastructure for generating/reading the function indexes as that is needed first to drive the importing. Let me know if the staging of work I describe there makes sense.
>>>
>>> (What's difficult for me with the staging is how many threads are going
>>> on in parallel. This stuff is all related, and some of the patches
>>> don't make sense outside the context of the others, and if we need
>>> changes in one, that'll influence the design of the others. I don't
>>> really see why the review isn't all in one place. Maybe I just need to
>>> stop worrying and learn to love my inbox, though... and I guess if I'd
>>> responded more promptly to the RFCs you wouldn't have unleashed all the
>>> threads.)
>>
>> Yea, I definitely see your point. I thought breaking it up into
>> multiple smaller patches would make it easier to review. But it
>> certainly would be easier for me as well to have a single larger patch
>> as I keep having to update multiple patches when I rename things or
>> make other data structure changes in one patch.
>>
>> Do you think I should kill these individual patches off and combine
>> into one larger patch when I make the next round of changes?
>
> When I'm in a similar position, I tend to send one review thread, with
> the individual commits attached (0001-abc.patch, 0002-def.patch, etc.)
> as well as the combined patch (all.patch). Because this approach
> violates Phab's view of the world (Phab seems to be pretty popular),
> it's somewhat uncommon.
Perhaps I should not use Phab in this type of situation going forward.
David also suggested I create some kind of website to host links to
all the relevant RFS and other docs, and patches, and keep that up to
date. I'll take a look at doing that.
>
> Ideally, stuff this big gets a high-level sign off before the patches
> are even written, and then you can post patches (usually for
> post-commit review if there's nothing too contentious) as you write the
> code. It doesn't always work that way in practice, but that kind of
> incremental development is the goal codified in the developer policy:
> http://llvm.org/docs/DeveloperPolicy.html#incremental-development
Yes, getting to the point where I can do smaller post-commit changes
would be great.
>
>>>> To facilitate the process of reading/materializing a single function each time a module is imported from, as described above, I we be parsing the module level metadata once as a post-pass. Otherwise supporting the interleaved importing from multiple modules is probably very difficult.
>>>
>>> I seem to remember that you only pull in the necessary metadata? I'm
>>> wondering how you do this, particularly for debug info, which has all
>>> sorts of circular references. In particular, !llvm.dbg.cu seems like it
>>> could pull in all the subprograms (and their variables) even if you've
>>> just cherry-picked a couple of functions. How do you get around that?
>>
>> So there are a lot of circular references and therefore there is still
>> a fair amount pulled in (not sure how much has changed in this since I
>> did my prototype work very early this year but this was an issue for
>> types if I remember). But the llvm.dbg.cu is handled (specifically,
>> the list of subprograms hanging off of the DICompileUnit). When doing
>> the post-importing-pass to link the needed metadata from a module we
>> imported from, I have a list of all functions that were imported, and
>> I also recursively find any other DISubProgram functions reached from
>> MD in the imported function. This is passed to the ValueMapper and it
>> only maps in DISubProgram MD nodes and descendants for the referenced
>> functions.
>
> Interesting, this should work. Do you filter the types as well somehow?
Nothing special for types. Any that aren't referenced by a referenced
DISubProgram wouldn't get pulled in. But I believe I saw a lot of
circular refs between types, so I was probably pulling in a fair
amount.
>
>>> (I've been planning some debug info schema changes to make this sort of
>>> thing easier, but were you able to do this without them? How?
>>> (Frankly, it'd be helpful to look at the complete prototype (warts and
>>> all) if it's available somewhere.))
>>
>> It isn't available anywhere, and I'm very reluctant to do so as it was
>> very much a prototype and my first LLVM work. It has a lot of cruft
>> like tons of debugging/tracing code, stuff that isn't needed anymore
>> (some of it commented out but left in with big notes about why I went
>> with a different approach), and I didn't put much thought into nice
>> data structures.
>
> I get it; no one likes to show their nasty prototypes. IMO, seeing the
> development process (including abandoned ideas) might help interested
> parties better understand what you're proposing (and frankly it would be
> valuable hack on it and collect our own numbers). Up to you of course.
>
>> Perhaps I should just tackle the implementation of the importing
>> mechanics next. Unfortunately, getting anything that is testable in a
>> good way requires implementing the linkage changes/static promotion
>> support and the importing pass and other things. But perhaps I can
>> come up with a simple way to test/demonstrate the importing mechanics
>> without all of that.
>
> I suggest extending `llvm-link` to do the importing. I could imagine a
> RUN line like:
> --
> llvm-link a.ll -import=f:b.ll -import=g:c.ll -import=h:b.ll -S < %s | FileCheck %s
> --
> That would link a.ll into the destination module, and then cherry-pick
> @f from b.ll, @g from c.ll, and @h from b.ll again.
>
> Maybe that wouldn't test quite the right thing (and I don't have a
> specific idea for testing static promotion), but my point is that tools
> like `llvm-link` and `llvm-lto` exist to support testing, so feel free
> to add options and/or add new tools as necessary.
Great idea, will look at using that for testing especially while the
whole pipeline isn't yet there.
Thanks.
>
>>>> This is again different than the existing lazy loading, which is parsing/materializing all necessary functions and metadata out of each module in a single pass.
>>>
>>> Right, but during "normal" lazy loading, function bodies can be deleted
>>> before they're loaded, so that they're never loaded at all. It would be
>>> awesome to delay loading metadata until we know what is actually
>>> needed/used, so it's not all leaked onto the LLVMContext.
>>
>> Ok, it seems like part of this (the DISubProgram handling) could be
>> used for lazy loading as well then. But it wouldn't need to be done as
>> a post pass, which makes a few other things easier than in the ThinLTO
>> case (fixing up temp MD).
>
>
> Awesome.
--
Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413
More information about the llvm-commits
mailing list