[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.

Wed Oct 28 10:38:46 PDT 2020

On Wed, Oct 28, 2020 at 6:01 AM Alexey Lapshin <avl.lapshin at gmail.com>
wrote:

>
> On 28.10.2020 01:49, David Blaikie wrote:
>
>
>
> On Tue, Oct 27, 2020 at 12:34 PM Alexey Lapshin <avl.lapshin at gmail.com>
> wrote:
>
>>
>> On 27.10.2020 20:32, David Blaikie wrote:
>>
>>
>>
>> On Tue, Oct 27, 2020 at 1:23 AM Alexey Lapshin <avl.lapshin at gmail.com>
>> wrote:
>>
>>>
>>> On 26.10.2020 22:38, David Blaikie wrote:
>>>
>>>
>>>
>>> On Sun, Oct 25, 2020 at 9:31 AM Alexey Lapshin <avl.lapshin at gmail.com>
>>> wrote:
>>>
>>>>
>>>> On 23.10.2020 19:43, David Blaikie wrote:
>>>>
>>>>
>>>>>>
>>>>>>
>>>>> Ah, yeah - that seems like a missed opportunity - duplicating the
>>>>> whole type DIE. LTO does this by making monolithic types - merging all the
>>>>> members from different definitions of the same type into one, but that's
>>>>> maybe too expensive for dsymutil (might still be interesting to know how
>>>>> much more expensive, etc). But I think the other way to go would be to
>>>>> produce a declaration of the type, with the relevant members - and let the
>>>>> DWARF consumer identify this declaration as matching up with the earlier
>>>>> definition. That's the sort of DWARF you get from the non-MachO default
>>>>> -fno-standalone-debug anyway, so it's already pretty well tested/supported
>>>>> (support in lldb's a bit younger/more work-in-progress, admittedly). I
>>>>> wonder how much dsym size there is that could be reduced by such an
>>>>> implementation.
>>>>>
>>>>> I see. Yes, that could be done and I think it would result in
>>>>> noticeable size reduction(I do not know exact numbers at the moment).
>>>>>
>>>>> I work on multi-thread DWARFLinker now and it`s first version will do
>>>>> exactly the same type processing like current dsymutil.
>>>>>
>>>> Yeah, best to keep the behavior the same through that
>>>>
>>>>> Above scheme could be implemented as a next step and it would result
>>>>> in better size reduction(better than current state).
>>>>>
>>>>> But I think the better scheme could be done also and it would result
>>>>> in even bigger size reduction and in faster execution. This scheme is
>>>>> something similar to what you`ve described above: "LTO does - making
>>>>> monolithic types - merging all the members from different definitions of
>>>>> the same type into one".
>>>>>
>>>> I believe the reason that's probably not been done is that it can't be
>>>> streamed - it'd lead to buffering more of the output
>>>>
>>>> yes. The fact that DWARF should be streamed into AsmPrinter complicates
>>>> parallel dwarf generation. In my prototype, I generate
>>>> several resulting files(each for one source compilation unit) and then
>>>> sequentially glue them into the final resulting file.
>>>>
>>> How does that help? Do you use relocations in those intermediate object
>>> files so the DWARF in them can refer across files?
>>>
>>> It does not help with referring across the file. It helps to parallel
>>> the generation of CU bodies.
>>> It is not possible to write two CUs in parallel into AsmPrinter. To make
>>> possible parallel generation I stream them into different AsmPrinters(this
>>> comment is for "I believe the reason that's probably not been done is that
>>> it can't be streamed". which initially was about referring across the file,
>>> but it seems I added another direction).
>>>
>> Oh, I see - thanks for explaining, essentially buffering on-disk.
>>
>>>
>>>> (if two of these expandable types were in one CU - the start of the
>>>> second type couldn't be known until the end because it might keep getting
>>>> pushed later due to expansion of the first type) and/or having to revisit
>>>> all the type references (the offset to the second type wouldn't be known
>>>> until the end - so writing the offsets to refer to the type would have to
>>>> be deferred until then).
>>>>
>>>> That is the second problem: offsets are not known until the end of file.
>>>> dsymutil already has that situation for inter-CU references, so it has
>>>> extra pass to
>>>> fixup offsets.
>>>>
>>> Oh, it does? I figured it was one-pass, and that it only ever refers
>>> back to types in previous CUs? So it doesn't have to go back and do a
>>> second pass. But I guess if sees a declaration of T1 in CU1, then later on
>>> sees a definition of T1 in CU2, does it somehow go back to CU1 and remove
>>> the declaration/make references refer to the definition in CU2? I figured
>>> it'd just leave the declaration and references to it as-is, then add the
>>> definition and use that from CU2 onwards?
>>>
>>> For the processing of the types, it do not go back.
>>> This "I figured it was one-pass, and that it only ever refers back to
>>> types in previous CUs"
>>> and this "I figured it'd just leave the declaration and references to it
>>> as-is, then add the definition and use that from CU2 onwards" are correct.
>>>
>> Great - thanks for explaining/confirming!
>>
>>>
>>> With multi-thread implementation such situation would arise more often
>>>> for type references and so more offsets should be fixed during
>>>> additional pass.
>>>>
>>>> DWARFLinker could create additional artificial compile unit and put all
>>>>> merged types there. Later patch all type references to point into this
>>>>> additional compilation unit.  No any bits would be duplicated in that case.
>>>>> The performance improvement could be achieved due to less amount of the
>>>>> copied DWARF and due to the fact that type references could be updated when
>>>>> DWARF is cloned(no need in additional pass for that).
>>>>>
>>>> "later patch all type references to point into this additional
>>>> compilation unit" - that's the additional pass that people are probably
>>>> talking/concerned about. Rewalking all the DWARF. The current dsymutil
>>>> approach, as far as I know, is single pass - it knows the final, absolute
>>>> offset to the type from the moment it emits that type/needs to refer to it.
>>>>
>>>> Right. Current dsymutil approach is single pass. And from that point of
>>>> view, solution
>>>> which you`ve described(to produce a declaration of the type, with the
>>>> relevant members)
>>>> allows to keep that single pass implementation.
>>>>
>>>> But there is a restriction for current dsymutil approach: To process
>>>> inter-CU references
>>>> it needs to load all DWARF into the memory(While it analyzes which part
>>>> of DWARF is live,
>>>> it needs to have all CUs loaded into the memory).
>>>>
>>> All DWARF for a single file (which for dsymutil is mostly a single CU,
>>> except with LTO I guess?), not all DWARF for all inputs in memory at once,
>>> yeah?
>>>
>>> right. In dsymutil case - all DWARF for a single file(not all DWARF for
>>> all inputs in memory at once).
>>> But in llvm-dwarfutil case single file contains DWARF for all original
>>> input object files and it all becomes
>>> loaded into memory.
>>>
>> Yeha, would be great to try to go CU-by-CU.
>>
>>> That leads to huge memory usage.
>>>> It is less important when source is a set of object files(like in
>>>> dsymutil case) and this
>>>> become a real problem for llvm-dwarfutil utility when source is a
>>>> single file(With current
>>>> implementation it needs 30G of memory for compiling clang binary).
>>>>
>>> Yeah, that's where I think you'd need a fixup pass one way or another -
>>> because cross-CU references can mean that when you figure out a new layout
>>> for CU5 (because it has a duplicate type definition of something in CU1)
>>> then you might have to touch CU4 that had an absolute/cross-CU forward
>>> reference to CU5. Once you've got such a fixup pass (if dsymutil already
>>> has one? Which, like I said, I'm confused why it would have one/that
>>> doesn't match my very vague understanding) then I think you could make
>>> dsymutil work on a per-CU basis streaming things out, then fixing up a few
>>> offsets.
>>>
>>> When dsymutil deduplicates types it changes local CU reference into
>>> inter-CU reference(so that CU2(next) could reference type definition from
>>> CU1(prev)). To do this change it does not need to do any fixups currently.
>>>
>>> When dsymutil meets already existed(located in the input object file)
>>> inter-CU reference pointing into the CU which has not been processed
>>> yet(and then its offset is unknown) it marks it as "forward reference" and
>>> patches later during additional pass "fixup forward references" at a time
>>> when offsets are known.
>>>
>> OK, so limited 2 pass system. (does it do that second pass once at the
>> end of the whole dsymutil run, or at the end of each input file? (so if an
>> input file has two CUs and the first CU references a type in the second CU
>> - it could write the first CU with a "forward reference", then write the
>> second CU, then fixup the forward reference - and then go on to the next
>> file and its CUs - this could improve performance by touching recently used
>> memory/disk pages only, rather than going all the way back to the start
>> later on when those pages have become cold)
>>
>> yes, It does it in the end of each input file.
>>
>>
>>
>>> If CUs would be processed in parallel their offsets would not be known
>>> at the moment when local type reference would be changed into inter-CU
>>> reference. So we would need to do the same fix-up processing for all
>>> references to the types like we already do for other inter-CU references.
>>>
>> Yeah - though the existence of this second "fixup forward references"
>> system - yeah, could just use it much more generally as you say. Not an
>> extra pass, just the existing second pass but having way more fixups to
>> fixup in that pass.
>>
>> If we would be able to change the algorithm in such way :
>>
>> 1. analyse all CUs.
>> 2. clone all CUs.
>>
>> Then we could create a merged type table(artificial CU containing types)
>> during step1.
>> If that type table would be written first, then all following CUs could
>> use known offsets
>> to the types and we would not need additional fix-up processing for type
>> references.
>> It would still be necessary to fix-up other inter-CU references. But it
>> would not be necessary
>> to fix-up type references (which constitute the vast majority).
>>
>
> To me, that sounds more expensive than the fixup forward references pass.
>
> If we would speak about direct comparison then yes loading DWARF one more
> time looks more expensive than fixup forward references pass. But if we
> would speak about the general picture then it could probably be beneficial:
>
> 1. merging types would lead to a smaller size of resulting DWARF. This
> would speed up the process.
>    f.e. If we would switch "odr types deduplication" off in current
> implementation then it would increase execution time two times. That is
> because more DWARF should be cloned and written in the result.
> Implementation of "merging types" would probably have a similar effect
>    - It would speed-up the overall process. So from one side additional
> step for loading DWARF would
>    decrease performance but a smaller amount of resulting data would
> increase performance.
>
> 2. When types would be put in the first CU then we would have a simple
> strategy for our liveness analysis algorithm: just always keep the first CU
> in memory. This allows us to speed up our liveness analysis step.
>
> Anyway, all the above is just an idea for future work. Currently, I am
> going to implement multithread processing for CUs loaded into memory and
> having the same type of processing as it currently is(Which assumes that
> "fixup forward references pass" started to do more work by fixing types
> references).
>
>
>
>
>>
>>
>>> Without loading all CU into the memory it would require two passes
>>>> solution. First to analyze
>>>> which part of DWARF relates to live code and then second pass to
>>>> generate the result.
>>>>
>>> Not sure it'd require any more second pass than a "fixup" pass, which it
>>> sounds like you're saying it already has?
>>>
>>> It looks like it would need an additional pass to process inter-CU
>>> references(existed in incoming file) if we do not want to load all CUs into
>>> memory.
>>>
>> Usually inter-CU references aren't used, except in LTO - and in LTO all
>> the DWARF deduplication and function discarding is already done by the IR
>> linker anyway. (ThinLTO is a bit different, but really we'd be better off
>> teaching it the extra tricks anyway (some can't be fixed in ThinLTO - like
>> emitting a "Home" definition of an inline function, only to find out other
>> ThinLTO backend/shards managed to optimize away all uses of the function...
>> so some cleanup may be useful there)). It might be possible to do a more
>> dynamic/rolling cache - keep only the CUs with unresolved cross-CU
>> references alive and only keep them alive until their cross-CU references
>> are found/marked alive. This should make things no worse than the
>> traditional dsymutil case - since cross-CU references are only
>> effective/generally used within a single object file (it's possible to
>> create relocations for them into other files - but I know LLVM doesn't
>> currently do this and I don't think GCC does it) with multiple CUs anyway -
>> so at most you'd keep all the CUs from a single original input file alive
>> together.
>>
>> But, since it is a DWARF documented case the tool should be ready for
>> such case(when inter-CU
>> references are heavily used).
>>
>
> Sure - but by implementing a CU liveness window like that (keeping CUs
> live only so long as they need to be rather than an all-or-nothing
> approach) only especially quirky inputs would hit the worst case while the
> more normal inputs could perform better.
>
> It is not clear what should be put in such CU liveness window. If CU100
> references CU1 - how could we know that we need to put CU1 into CU liveness
> window before we processed CU100?
>
Fair point, not just forward references to worry about but backward
references too. I wonder how much savings there is in the liveness analysis
compared to "keep one copy of everything, no matter whether it's live or
not", then it can be a pure forward progress situation. (with the quirk
that you might emit a declaration for an entity once, then a definition for
it later - alternatively if a declaration is seen it could be skipped under
the assumption that a definition will follow (& use a forward ref fixup) -
and if none is found, splat some stub declarations into a trailing CU at
the end)

>
>
>
>> Moreover, llvm-dwarfutil would be the tool producing
>> exactly such situation. The resulting file(produced by llvm-dwarfutil)
>> would contain a lot of
>> inter-CU references. Probably, there is no practical reasons to apply
>> llvm-dwarfutil to the same
>> file twice but it would be a good test for the tool.
>>
>
> It'd be a good stress test, but not necessarily something that would need
> to perform the best because it wouldn't be a common use case.
>
> I agree that we should not slow down the DWARFLinker in common cases only
> because we need to support the worst cases.
> But we also need to implement a solution which works in some acceptable
> manner for the worst case.
>
I think that depends on "acceptable" - correct, yes. Practical to run in
reasonable time/memory? Not necessarily, in my opinion.

> The current solution - loading everything in memory - makes it hard to use
> in a non-dsymutil scenario(llvm-dwarfutil).
>
I agree it's worth exploring the non-dsymutil scenario, as you are - I'm
just saying we don't necessarily need to support high usability (fast/low
memory usage/etc) llvm-dwarfutil on an already dwarfutil'd binary (but as
you've pointed out, the "window" is unknowable because of backward
references, so this whole subthread is perhaps irrelevant).

> There could be several things which could be used to decide whether we
> need to go on a light or heavy path:
>
> 1. If the input contains only a single CU we do not need to unload it from
> memory. Thus - we would not need to do an extra DWARF loading pass.
> 2. If abbreviations from the whole input file do not contain inter-CU
> references then while doing liveness analysis,  we do not need to wait
> until other CUs are processed.
>
(2) Yeah, that /may/ be a good idea, cheap to test, etc. Though I'd still
wonder if a more general implementation strategy could be found that would
make it easier to get a sliding scale of efficiency depending on how much
inter-CU references where were, not a "if there are none it's good, if
there are any it's bad or otherwise very different to implement".

>
> Then that scheme would be used for worst cases:
>
> 1. for (CU : CU1...CU100) {
>      load CU.
>      analyse CU.
>      unload CU.
>    }
> 2. for (CU : CU1...CU100) {
>      load CU.
>      clone CU.
>      unload CU.
>    }
> 3. fixup forward references.
>
> and that scheme for light cases:
>
> 1. for (CU : CU1...CU100) {
>      load CU.
>      analyse CU.
>      clone CU.
>      unload CU.
>    }
> 2. fixup forward references.
>
>
>
>> Generally, I think we should not assume that inter-CU references would be
>> used in a limited way.
>>
>> Anyway, if this scheme:
>>
>> 1. analyse all CUs.
>> 2. clone all CUs.
>>
>> would work slow then we would need to continue with one-pass solution and
>> not support complex closely coupled inputs.
>>
>
> yeah, certainly seeing the data/experiments will be interesting, if you
> end up implementing some different strategies, etc.
>
> I guess one possibility for parallel generation could be something more
> like Microsoft's approach with a central debug info server that compilers
> communicate with - not that exact model, I mean, but if you've got parallel
> threads generating reduced DWARF into separate object files - they could
> communicate with a single thread responsible for type emission - the type
> emitter would be given types from the separate threads and compute their
> size, queue them up to be streamed out to the type CU (& keep the source CU
> alive until that work was done) - such a central type emitter could quickly
> determine the size of the type to be emitted and compute future type
> offsets (eg: if 5 types were in the queue, it could've figured out the
> offset of those types already) to answer type offset queries quickly and
> unblock the parallel threads to continue emitting their CUs containing type
> references.
>
> yes. Thank you. Would think about it.
>
> Alexey.
>
>
> - Dave
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201028/c400d7e3/attachment.html>