[PATCH] D11722: [ThinLTO] Bitcode reading/writing support for ThinLTO function summary/index

Mon Aug 24 07:02:06 PDT 2015

On Sun, Aug 23, 2015 at 7:42 PM, Teresa Johnson <tejohnson at google.com> wrote:
> On Wed, Aug 19, 2015 at 11:51 AM, Teresa Johnson <tejohnson at google.com> wrote:
>> On Wed, Aug 19, 2015 at 10:12 AM, Duncan P. N. Exon Smith
>> <dexonsmith at apple.com> wrote:
>>>
>>>> On 2015-Aug-19, at 09:07, Teresa Johnson <tejohnson at google.com> wrote:
>>>>
>>>> On Tue, Aug 18, 2015 at 7:58 PM, Duncan P. N. Exon Smith
>>>> <dexonsmith at apple.com> wrote:
>>>>>
>>>>>> On 2015-Aug-13, at 10:22, Teresa Johnson <tejohnson at google.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 13, 2015 at 9:42 AM, Duncan P. N. Exon Smith <dexonsmith at apple.com> wrote:
>>>>>>
>>>>>>> On 2015-Aug-12, at 21:23, Teresa Johnson <tejohnson at google.com> wrote:
>>>>>>>
>>>>>>> tejohnson updated the summary for this revision.
>>>>>>> tejohnson updated this revision to Diff 32027.
>>>>>>> tejohnson added a comment.
>>>>>>>
>>>>>>> Removed native object wrapper support from this patch.
>>>>>>>
>>>>>>>
>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__reviews.llvm.org_D11722&d=BQIFaQ&c=eEvniauFctOgLOKGJOplqw&r=vftFLnHiqThJHdL0qZWd_Vo12qdMVOZDFnNVBhP9GKA&m=xT8yH7qMQ6_t61J8oyWNyy_aZHTAO5eaqJolhFZKXI0&s=4iNlSRUosXZmoWBVA1B4l8yk8oznZyl5t8sohjd1fEY&e=
>>>>>>>
>>>>>>> Files:
>>>>>>> include/llvm/Bitcode/LLVMBitCodes.h
>>>>>>> include/llvm/Bitcode/ReaderWriter.h
>>>>>>> lib/Bitcode/Reader/BitcodeReader.cpp
>>>>>>> lib/Bitcode/Writer/BitcodeWriter.cpp
>>>>>>> tools/llvm-bcanalyzer/llvm-bcanalyzer.cpp
>>>>>>>
>>>>>>> <D11722.32027.patch>
>>>>>>
>>>>>> (Sorry I'm so late getting to this patch.)
>>>>>>
>>>>>> Hi Duncan,
>>>>>> Thanks for the comments.
>>>>>>
>>>>>>
>>>>>> I'm concerned about the direction.  IIUC, the purpose of this is faster
>>>>>> lazy-loading.  Why do this in a side-channel?  Why not optimize the
>>>>>> existing lazy-loader?  I imagine the LTO model implemented by the gold
>>>>>> plugin would benefit from this, for example.  If we need an extra index
>>>>>> to make lazy-loading "fast enough", it seems like the index should be
>>>>>> available (as an opt-in for time/space tradeoff) for other consumers as
>>>>>> well.  Alternatively, maybe there's something we can change about how
>>>>>> the existing lazy-loader works (or the layout of bitcode) to be more
>>>>>> amenable to the partial loading ThinLTO needs.
>>>>>>
>>>>>> Only one part of the index (the function's bitcode offset) is used to do the fast loading of the function from a given module. Much of the other information in the ThinLTO sections is to identify/locate the module to import from (via the module strtab index which is populated when creating the combined (global) index), and information used in making importing decisions from another module (like the function's size, hotness when there is profile, etc, which we don't have since we haven't yet parsed that other module).
>>>>>>
>>>>>> The other issue with using the existing lazy loading support is that we may import from a number of other modules in some interleaved fashion, so we may open/import a function/close a module multiple times. My  understanding is that the lazy loading is more for the case of loading a few different functions in sequence.
>>>>>>
>>>>>> So for example, if we have:
>>>>>>
>>>>>> A.cc:
>>>>>>
>>>>>> a() {
>>>>>>   b1();
>>>>>>   ...
>>>>>>   c1();
>>>>>>   ...
>>>>>>   b2();
>>>>>>   ...
>>>>>>   c2();
>>>>>> }
>>>>>>
>>>>>> B.cc:
>>>>>>
>>>>>> b1() { d1(); }
>>>>>> b2() { ... }
>>>>>>
>>>>>> C.cc:
>>>>>>
>>>>>> c1() { d2(); }
>>>>>> c2() { ... }
>>>>>>
>>>>>> D.cc:
>>>>>>
>>>>>> d1() { ... }
>>>>>> d2() { ... }
>>>>>>
>>>>>>
>>>>>> When compiling A.cc through the parallel backend's importing pass, we may have the following sequence of events:
>>>>>>
>>>>>> Consider importing b1  -> Decide to import b1  (exposes call to d1)
>>>>>> Consider importing d1  -> Decide to import d1
>>>>>> Consider importing c1  -> Decide to import c1  (exposes call to d2)
>>>>>> Consider importing d2  -> Decide to import d2
>>>>>> Consider importing b2  -> Decide *not* to import b2
>>>>>> Consider importing c2  -> Decide to import c2
>>>>>>
>>>>>> For each of the inter-module calls considered in some priority based order, we would look up the callee in the combined index, which provides some information such as the callee's size, hotness, etc. A decision on whether to import is made based on that information (without doing any parsing of the callee's module). If we decide to import, then the combined index entry for the callee provides the callee module path and the bitcode index, which is used to open the module and import just that function, then the callee's module is closed. Some of the lazy loading support could potentially be used in place of the bitcode index, although I am not convinced it is suited for that purpose. In any case, that is just a small part of what is contained in the function summary/index as described earlier.
>>>>>>
>>>>>> On the flip side, the bitcode index could be used by the lazy loader, in place of building the index on the fly. I'm not sure how much benefit there is to lazy loading, which eventually needs to read the entire module once a function is materialized.
>>>>>
>>>>> It seems generally useful to be able to cherry-pick functions out of a
>>>>> large bitcode file without scanning the whole thing.  It's really cool
>>>>> that you can teardown and buildup the BitcodeReader, swapping different
>>>>> modules in and out.
>>>>>
>>>>> I don't like this being tied to ThinLTO, since I think it's useful for
>>>>> other consumers as well.  To start, the old/normal lazy loader should
>>>>> use the index when it's available.  I'm sure other people will think of
>>>>> other crazy uses.
>>>>
>>>> I've been thinking this morning about the best way to leverage the
>>>> ThinLTO index in the existing lazy loader and have some ideas. But I
>>>> think that just applies to using the bitcode offset of the function
>>>> body block.
>>>
>>> That might be it for now (besides metadata, addressed below).
>>>
>>>> Do you think that it is interesting to put the rest of the summary
>>>> information I am going to use for importing decisions like the
>>>> instruction count, profile hotness, other function characteristics
>>>> (queried from other modules, via the combined index) in the bitcode in
>>>> a non-ThinLTO-specific way? Perhaps they could still be optional and
>>>> currently only produced with -fthinlto, but the block/record names not
>>>> contain "ThinLTO".
>>>
>>> That could work.  I imagine other LTO use cases (e.g., linker-driven
>>> partitioning) would also want the profile information.
>>
>> Here's what I am currently thinking:
>>
>> In order to make the function's bitcode index usable for normal lazy
>> function loading, move it out of the function summary block and into
>> the value symbol table (VST). The current VST code entry records
>> contain the valueid and value name string. This would be expanded to
>> be valueid, bitcode offset, value name string. The VST is positioned
>> in the bitcode right before the function blocks, and the
>> DeferredFunctionInfo map can be populated directly from these bitcode
>> offsets. Since the VST reading locates the GVs for each entry, I think
>> the existing FunctionsWithBodies used for lazy reading could go, along
>> with rememberAndSkipFunctionBody and findFunctionInStream.
>
> I was working on emitting records in this format and ran into a
> phase-ordering issue. I have a solution but I am not completely happy
> with it.
>
> The issue is (obvious to me in hindsight) that when we emit the VST we
> have not yet emitted the function blocks and therefore don't yet know
> the function bitcode offsets. In my earlier design where this was
> emitted in the ThinLTO specific sections after the function blocks we
> of course already had the function block offsets. For ThinLTO it is
> fine to have the function bitcode offsets later in the module since
> they aren't used by the same module (only used out of the combined
> index). But in order to use these offsets for lazy function reading in
> the same module, they need to be available before the function blocks,
> such as in the VST.
>
> I looked at how the function block offsets might be backpatched once
> they are known, such as the way the block sizes are backpatched into
> the ENTER_SUBBLOCK abbrev records, but there are a couple of issues:
>
> 1) Bitcode offset representation:
>
> Backporting in general precludes using VBR to represent the bitcode
> offset, since we don't know how many chunks are needed to encode the
> yet-unknown bitcode offset. That means we need to use a fixed width
> field. It looks like we use a uint64_t to hold the bitcode offset
> while encoding/decoding, so presumably we need a 64-bit fixed width
> field for each function in the VST. This is a lot less efficient than
> a VBR for most function offsets.
>
> 2) Backpatch mechanics:
>
> Even using a fixed width field, it is tricky to backpatch. Unlike the
> block size backpatched into the ENTER_SUBBLOCK, we may not be
> backporting at a byte boundary, and the VST bit offset to backpatch
> into depends on the width of the abbrev id and any fields such as the
> code that are written before we write the bitcode offset in the VST
> abbrev record. Not to say it can't be computed and that I can't add a
> facility to backpatch at a non-byte boundary, but it seems like it
> could be hacky.
>
>
> Here's what I am doing right now, that works, but I am not sure this
> is the best approach. I initially write the module level VST into the
> output stream with the 64-bit bitcode offset fields set to 0. After
> writing all the functions (which save their bitcode offsets), I then
> rewrite the entire VST with the bitcode offsets filled in. Even this
> is not simple as writing to the bitstream typically involves appending
> to the output buffer. So I create a second (temporary) stream, do some
> basic setup to get the abbrev ids set up (not written, just known to
> the new bitstream writer), etc, and write the new VST into that. Then
> I do a word by word copy into the original stream, overwriting the old
> VST, using BackpatchWord.
>
> This does the trick, and was the approach that seemed the cleanest.
> But I am happy to entertain other ideas!

Another possibility, that allows use of VBR to encode the function
block bitcode offset:

Write the functions to a temporary bitstream writer before writing the
VST, recording a map from function to bitcode offset (in the (ThinLTO)
Index data structure). The VST then uses these saved bitcode offsets
while writing the VST records. Then the temp function stream is copied
to the main stream where the function blocks are normally written
(similar to the temp stream/copying I describe above for the VST with
the other approach I am currently using). This means that the lazy
reader needs to change a bit as the offsets are not the offset into
the whole bitcode file, but rather the offset into the function
blocks. But the lazy reader knows where the function blocks start
(since that's where it stops parsing greedily) and could save that
initial function block offset to add to the offsets in the index for
each function.

This seems like a better approach from a bitcode size perspective
since it allows use of VBR for the function offsets. Let me know what
you think and if there is yet another better way to handle this.

Thanks,
Teresa

>
> Thanks,
> Teresa
>
>
>>
>> For ThinLTO, instead of needing the importer to pass in both the
>> function to be imported and its bitcode offset, we only need the
>> function as ThinLTO also needs to read the VST for the module being
>> imported from, and would get the bitcode offset that way. It then has
>> the same behavior as lazy function reading, but only materializing the
>> single function of interest.
>>
>> We still need the function summary data for importing decisions (was
>> THINLTO_FUNCTION_SUMMARY_BLOCK), but it can be renamed to the more
>> generic FUNCTION_SUMMARY_BLOCK. And it no longer contains the
>> function's bitcode offset (just the index into the module path string
>> table and the summary data).
>>
>> We also still need the module path string table (only shows up in the
>> combined index though, not in the per-module index). Still seems
>> specific to ThinLTO, but we could be consistent and rename to
>> MODULE_STRTAB_BLOCK (i.e. strip off the leading THINLTO_).
>>
>> Finally, we still need a way to associate each VST entry with the
>> corresponding function summary block offset. This was in the
>> THINLTO_SYMTAB_BLOCK. If we are making all of this non-ThinLTO
>> specific, I think it makes the most sense to merge this info into the
>> VST, but as separate records that are only enabled currently under
>> -fthinlto. E.g. the new optional VST records would be something like
>> VST_SUMMARY_ENTRY and map from the corresponding VST_CODE_ENTRY's
>> valueid to the summary block's offset for that function. Alternatively
>> this could be merged with the VST_CODE_ENTRY, but I didn't want to
>> bloat out those records by default.
>>
>> And the outer THINLTO_BLOCK encapsulating all the old ThinLTO blocks goes away.
>>
>
>
>
> --
> Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413

-- 
Teresa Johnson | Software Engineer | tejohnson at google.com | 408-460-2413