[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.

Sun Nov 1 17:11:15 PST 2020

I think if we're in the realm of DWARF extensions a whole bunch of other
considerations come into it (& indeed, your suggested proposal may be a
good one - but I think it's a very wide problem space once we're
considering DWARF extensions). Mostly I was making
arguments/suggestions/thoughts on the basis of being compatible with all
existing DWARF producers.

On Sun, Nov 1, 2020 at 2:05 PM Alexey Lapshin <avl.lapshin at gmail.com> wrote:

>
> On 28.10.2020 20:38, David Blaikie wrote:
>
>
>
> On Wed, Oct 28, 2020 at 6:01 AM Alexey Lapshin <avl.lapshin at gmail.com>
> wrote:
>
>>
>> On 28.10.2020 01:49, David Blaikie wrote:
>>
>>
>>
>> On Tue, Oct 27, 2020 at 12:34 PM Alexey Lapshin <avl.lapshin at gmail.com>
>> wrote:
>>
>>>
>>> On 27.10.2020 20:32, David Blaikie wrote:
>>>
>>>
>>>
>>> On Tue, Oct 27, 2020 at 1:23 AM Alexey Lapshin <avl.lapshin at gmail.com>
>>> wrote:
>>>
>>>>
>>>> On 26.10.2020 22:38, David Blaikie wrote:
>>>>
>>>>
>>>>
>>>> On Sun, Oct 25, 2020 at 9:31 AM Alexey Lapshin <avl.lapshin at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On 23.10.2020 19:43, David Blaikie wrote:
>>>>>
>>>>>
>>>>>>>
>>>>>>>
>>>>>> Ah, yeah - that seems like a missed opportunity - duplicating the
>>>>>> whole type DIE. LTO does this by making monolithic types - merging all the
>>>>>> members from different definitions of the same type into one, but that's
>>>>>> maybe too expensive for dsymutil (might still be interesting to know how
>>>>>> much more expensive, etc). But I think the other way to go would be to
>>>>>> produce a declaration of the type, with the relevant members - and let the
>>>>>> DWARF consumer identify this declaration as matching up with the earlier
>>>>>> definition. That's the sort of DWARF you get from the non-MachO default
>>>>>> -fno-standalone-debug anyway, so it's already pretty well tested/supported
>>>>>> (support in lldb's a bit younger/more work-in-progress, admittedly). I
>>>>>> wonder how much dsym size there is that could be reduced by such an
>>>>>> implementation.
>>>>>>
>>>>>> I see. Yes, that could be done and I think it would result in
>>>>>> noticeable size reduction(I do not know exact numbers at the moment).
>>>>>>
>>>>>> I work on multi-thread DWARFLinker now and it`s first version will do
>>>>>> exactly the same type processing like current dsymutil.
>>>>>>
>>>>> Yeah, best to keep the behavior the same through that
>>>>>
>>>>>> Above scheme could be implemented as a next step and it would result
>>>>>> in better size reduction(better than current state).
>>>>>>
>>>>>> But I think the better scheme could be done also and it would result
>>>>>> in even bigger size reduction and in faster execution. This scheme is
>>>>>> something similar to what you`ve described above: "LTO does - making
>>>>>> monolithic types - merging all the members from different definitions of
>>>>>> the same type into one".
>>>>>>
>>>>> I believe the reason that's probably not been done is that it can't be
>>>>> streamed - it'd lead to buffering more of the output
>>>>>
>>>>> yes. The fact that DWARF should be streamed into AsmPrinter
>>>>> complicates parallel dwarf generation. In my prototype, I generate
>>>>> several resulting files(each for one source compilation unit) and then
>>>>> sequentially glue them into the final resulting file.
>>>>>
>>>> How does that help? Do you use relocations in those intermediate object
>>>> files so the DWARF in them can refer across files?
>>>>
>>>> It does not help with referring across the file. It helps to parallel
>>>> the generation of CU bodies.
>>>> It is not possible to write two CUs in parallel into AsmPrinter. To
>>>> make possible parallel generation I stream them into different
>>>> AsmPrinters(this comment is for "I believe the reason that's probably not
>>>> been done is that it can't be streamed". which initially was about
>>>> referring across the file, but it seems I added another direction).
>>>>
>>> Oh, I see - thanks for explaining, essentially buffering on-disk.
>>>
>>>>
>>>>> (if two of these expandable types were in one CU - the start of the
>>>>> second type couldn't be known until the end because it might keep getting
>>>>> pushed later due to expansion of the first type) and/or having to revisit
>>>>> all the type references (the offset to the second type wouldn't be known
>>>>> until the end - so writing the offsets to refer to the type would have to
>>>>> be deferred until then).
>>>>>
>>>>> That is the second problem: offsets are not known until the end of
>>>>> file.
>>>>> dsymutil already has that situation for inter-CU references, so it has
>>>>> extra pass to
>>>>> fixup offsets.
>>>>>
>>>> Oh, it does? I figured it was one-pass, and that it only ever refers
>>>> back to types in previous CUs? So it doesn't have to go back and do a
>>>> second pass. But I guess if sees a declaration of T1 in CU1, then later on
>>>> sees a definition of T1 in CU2, does it somehow go back to CU1 and remove
>>>> the declaration/make references refer to the definition in CU2? I figured
>>>> it'd just leave the declaration and references to it as-is, then add the
>>>> definition and use that from CU2 onwards?
>>>>
>>>> For the processing of the types, it do not go back.
>>>> This "I figured it was one-pass, and that it only ever refers back to
>>>> types in previous CUs"
>>>> and this "I figured it'd just leave the declaration and references to
>>>> it as-is, then add the definition and use that from CU2 onwards" are
>>>> correct.
>>>>
>>> Great - thanks for explaining/confirming!
>>>
>>>>
>>>> With multi-thread implementation such situation would arise more often
>>>>> for type references and so more offsets should be fixed during
>>>>> additional pass.
>>>>>
>>>>> DWARFLinker could create additional artificial compile unit and put
>>>>>> all merged types there. Later patch all type references to point into this
>>>>>> additional compilation unit.  No any bits would be duplicated in that case.
>>>>>> The performance improvement could be achieved due to less amount of the
>>>>>> copied DWARF and due to the fact that type references could be updated when
>>>>>> DWARF is cloned(no need in additional pass for that).
>>>>>>
>>>>> "later patch all type references to point into this additional
>>>>> compilation unit" - that's the additional pass that people are probably
>>>>> talking/concerned about. Rewalking all the DWARF. The current dsymutil
>>>>> approach, as far as I know, is single pass - it knows the final, absolute
>>>>> offset to the type from the moment it emits that type/needs to refer to it.
>>>>>
>>>>> Right. Current dsymutil approach is single pass. And from that point
>>>>> of view, solution
>>>>> which you`ve described(to produce a declaration of the type, with the
>>>>> relevant members)
>>>>> allows to keep that single pass implementation.
>>>>>
>>>>> But there is a restriction for current dsymutil approach: To process
>>>>> inter-CU references
>>>>> it needs to load all DWARF into the memory(While it analyzes which
>>>>> part of DWARF is live,
>>>>> it needs to have all CUs loaded into the memory).
>>>>>
>>>> All DWARF for a single file (which for dsymutil is mostly a single CU,
>>>> except with LTO I guess?), not all DWARF for all inputs in memory at once,
>>>> yeah?
>>>>
>>>> right. In dsymutil case - all DWARF for a single file(not all DWARF for
>>>> all inputs in memory at once).
>>>> But in llvm-dwarfutil case single file contains DWARF for all original
>>>> input object files and it all becomes
>>>> loaded into memory.
>>>>
>>> Yeha, would be great to try to go CU-by-CU.
>>>
>>>> That leads to huge memory usage.
>>>>> It is less important when source is a set of object files(like in
>>>>> dsymutil case) and this
>>>>> become a real problem for llvm-dwarfutil utility when source is a
>>>>> single file(With current
>>>>> implementation it needs 30G of memory for compiling clang binary).
>>>>>
>>>> Yeah, that's where I think you'd need a fixup pass one way or another -
>>>> because cross-CU references can mean that when you figure out a new layout
>>>> for CU5 (because it has a duplicate type definition of something in CU1)
>>>> then you might have to touch CU4 that had an absolute/cross-CU forward
>>>> reference to CU5. Once you've got such a fixup pass (if dsymutil already
>>>> has one? Which, like I said, I'm confused why it would have one/that
>>>> doesn't match my very vague understanding) then I think you could make
>>>> dsymutil work on a per-CU basis streaming things out, then fixing up a few
>>>> offsets.
>>>>
>>>> When dsymutil deduplicates types it changes local CU reference into
>>>> inter-CU reference(so that CU2(next) could reference type definition from
>>>> CU1(prev)). To do this change it does not need to do any fixups currently.
>>>>
>>>> When dsymutil meets already existed(located in the input object file)
>>>> inter-CU reference pointing into the CU which has not been processed
>>>> yet(and then its offset is unknown) it marks it as "forward reference" and
>>>> patches later during additional pass "fixup forward references" at a time
>>>> when offsets are known.
>>>>
>>> OK, so limited 2 pass system. (does it do that second pass once at the
>>> end of the whole dsymutil run, or at the end of each input file? (so if an
>>> input file has two CUs and the first CU references a type in the second CU
>>> - it could write the first CU with a "forward reference", then write the
>>> second CU, then fixup the forward reference - and then go on to the next
>>> file and its CUs - this could improve performance by touching recently used
>>> memory/disk pages only, rather than going all the way back to the start
>>> later on when those pages have become cold)
>>>
>>> yes, It does it in the end of each input file.
>>>
>>>
>>>
>>>> If CUs would be processed in parallel their offsets would not be known
>>>> at the moment when local type reference would be changed into inter-CU
>>>> reference. So we would need to do the same fix-up processing for all
>>>> references to the types like we already do for other inter-CU references.
>>>>
>>> Yeah - though the existence of this second "fixup forward references"
>>> system - yeah, could just use it much more generally as you say. Not an
>>> extra pass, just the existing second pass but having way more fixups to
>>> fixup in that pass.
>>>
>>> If we would be able to change the algorithm in such way :
>>>
>>> 1. analyse all CUs.
>>> 2. clone all CUs.
>>>
>>> Then we could create a merged type table(artificial CU containing types)
>>> during step1.
>>> If that type table would be written first, then all following CUs could
>>> use known offsets
>>> to the types and we would not need additional fix-up processing for type
>>> references.
>>> It would still be necessary to fix-up other inter-CU references. But it
>>> would not be necessary
>>> to fix-up type references (which constitute the vast majority).
>>>
>>
>> To me, that sounds more expensive than the fixup forward references pass.
>>
>> If we would speak about direct comparison then yes loading DWARF one more
>> time looks more expensive than fixup forward references pass. But if we
>> would speak about the general picture then it could probably be beneficial:
>>
>> 1. merging types would lead to a smaller size of resulting DWARF. This
>> would speed up the process.
>>    f.e. If we would switch "odr types deduplication" off in current
>> implementation then it would increase execution time two times. That is
>> because more DWARF should be cloned and written in the result.
>> Implementation of "merging types" would probably have a similar effect
>>    - It would speed-up the overall process. So from one side additional
>> step for loading DWARF would
>>    decrease performance but a smaller amount of resulting data would
>> increase performance.
>>
>> 2. When types would be put in the first CU then we would have a simple
>> strategy for our liveness analysis algorithm: just always keep the first CU
>> in memory. This allows us to speed up our liveness analysis step.
>>
>> Anyway, all the above is just an idea for future work. Currently, I am
>> going to implement multithread processing for CUs loaded into memory and
>> having the same type of processing as it currently is(Which assumes that
>> "fixup forward references pass" started to do more work by fixing types
>> references).
>>
>>
>>
>>
>>>
>>>
>>>> Without loading all CU into the memory it would require two passes
>>>>> solution. First to analyze
>>>>> which part of DWARF relates to live code and then second pass to
>>>>> generate the result.
>>>>>
>>>> Not sure it'd require any more second pass than a "fixup" pass, which
>>>> it sounds like you're saying it already has?
>>>>
>>>> It looks like it would need an additional pass to process inter-CU
>>>> references(existed in incoming file) if we do not want to load all CUs into
>>>> memory.
>>>>
>>> Usually inter-CU references aren't used, except in LTO - and in LTO all
>>> the DWARF deduplication and function discarding is already done by the IR
>>> linker anyway. (ThinLTO is a bit different, but really we'd be better off
>>> teaching it the extra tricks anyway (some can't be fixed in ThinLTO - like
>>> emitting a "Home" definition of an inline function, only to find out other
>>> ThinLTO backend/shards managed to optimize away all uses of the function...
>>> so some cleanup may be useful there)). It might be possible to do a more
>>> dynamic/rolling cache - keep only the CUs with unresolved cross-CU
>>> references alive and only keep them alive until their cross-CU references
>>> are found/marked alive. This should make things no worse than the
>>> traditional dsymutil case - since cross-CU references are only
>>> effective/generally used within a single object file (it's possible to
>>> create relocations for them into other files - but I know LLVM doesn't
>>> currently do this and I don't think GCC does it) with multiple CUs anyway -
>>> so at most you'd keep all the CUs from a single original input file alive
>>> together.
>>>
>>> But, since it is a DWARF documented case the tool should be ready for
>>> such case(when inter-CU
>>> references are heavily used).
>>>
>>
>> Sure - but by implementing a CU liveness window like that (keeping CUs
>> live only so long as they need to be rather than an all-or-nothing
>> approach) only especially quirky inputs would hit the worst case while the
>> more normal inputs could perform better.
>>
>> It is not clear what should be put in such CU liveness window. If CU100
>> references CU1 - how could we know that we need to put CU1 into CU liveness
>> window before we processed CU100?
>>
> Fair point, not just forward references to worry about but backward
> references too. I wonder how much savings there is in the liveness analysis
> compared to "keep one copy of everything, no matter whether it's live or
> not", then it can be a pure forward progress situation. (with the quirk
> that you might emit a declaration for an entity once, then a definition for
> it later - alternatively if a declaration is seen it could be skipped under
> the assumption that a definition will follow (& use a forward ref fixup) -
> and if none is found, splat some stub declarations into a trailing CU at
> the end)
>
>>
>>
>>
>>> Moreover, llvm-dwarfutil would be the tool producing
>>> exactly such situation. The resulting file(produced by llvm-dwarfutil)
>>> would contain a lot of
>>> inter-CU references. Probably, there is no practical reasons to apply
>>> llvm-dwarfutil to the same
>>> file twice but it would be a good test for the tool.
>>>
>>
>> It'd be a good stress test, but not necessarily something that would need
>> to perform the best because it wouldn't be a common use case.
>>
>> I agree that we should not slow down the DWARFLinker in common cases only
>> because we need to support the worst cases.
>> But we also need to implement a solution which works in some acceptable
>> manner for the worst case.
>>
> I think that depends on "acceptable" - correct, yes. Practical to run in
> reasonable time/memory? Not necessarily, in my opinion.
>
>> The current solution - loading everything in memory - makes it hard to
>> use in a non-dsymutil scenario(llvm-dwarfutil).
>>
> I agree it's worth exploring the non-dsymutil scenario, as you are - I'm
> just saying we don't necessarily need to support high usability (fast/low
> memory usage/etc) llvm-dwarfutil on an already dwarfutil'd binary (but as
> you've pointed out, the "window" is unknowable because of backward
> references, so this whole subthread is perhaps irrelevant).
>
>> There could be several things which could be used to decide whether we
>> need to go on a light or heavy path:
>>
>> 1. If the input contains only a single CU we do not need to unload it
>> from memory. Thus - we would not need to do an extra DWARF loading pass.
>> 2. If abbreviations from the whole input file do not contain inter-CU
>> references then while doing liveness analysis,  we do not need to wait
>> until other CUs are processed.
>>
> (2) Yeah, that /may/ be a good idea, cheap to test, etc. Though I'd still
> wonder if a more general implementation strategy could be found that would
> make it easier to get a sliding scale of efficiency depending on how much
> inter-CU references where were, not a "if there are none it's good, if
> there are any it's bad or otherwise very different to implement".
>
> I think, there is a scenario which would make it possible to process CU
> once for not referenced CUs and handle inter-CU references in a scalable
> way(even for dwarfutil`d binary):
>
> 1. Implement a global type's table and types merging. This allows us to
> have all types in the memory.
>    Then, all inter-CU type references would point into that memory type
> table.
>    (we do not know which CU should be put into CU liveness window, we also
> could not put all CUs into the memory, but we could put all types into the
> memory).
>
> 2. If there are not other inter-CU references then all CUs would be
> handled by one pass.
>
> 3. If there are other inter-CU references, then after all CU processed by
> the first pass we would have a list of referenced CUs. Then, we could
> delete already cloned data(for referenced CU) and start the process again:
>    load CU, mark liveness, clone data. This second pass would be done for
> only referenced CUs.
>    For not-complex, not closely coupled cases it would work relatively
> fast.
>
> 4. put memory type table into artificial CU. Update all type`s references.
>
>
>
>> Then that scheme would be used for worst cases:
>>
>> 1. for (CU : CU1...CU100) {
>>      load CU.
>>      analyse CU.
>>      unload CU.
>>    }
>> 2. for (CU : CU1...CU100) {
>>      load CU.
>>      clone CU.
>>      unload CU.
>>    }
>> 3. fixup forward references.
>>
>> and that scheme for light cases:
>>
>> 1. for (CU : CU1...CU100) {
>>      load CU.
>>      analyse CU.
>>      clone CU.
>>      unload CU.
>>    }
>> 2. fixup forward references.
>>
>>
>>
>>> Generally, I think we should not assume that inter-CU references would
>>> be used in a limited way.
>>>
>>> Anyway, if this scheme:
>>>
>>> 1. analyse all CUs.
>>> 2. clone all CUs.
>>>
>>> would work slow then we would need to continue with one-pass solution
>>> and not support complex closely coupled inputs.
>>>
>>
>> yeah, certainly seeing the data/experiments will be interesting, if you
>> end up implementing some different strategies, etc.
>>
>> I guess one possibility for parallel generation could be something more
>> like Microsoft's approach with a central debug info server that compilers
>> communicate with - not that exact model, I mean, but if you've got parallel
>> threads generating reduced DWARF into separate object files - they could
>> communicate with a single thread responsible for type emission - the type
>> emitter would be given types from the separate threads and compute their
>> size, queue them up to be streamed out to the type CU (& keep the source CU
>> alive until that work was done) - such a central type emitter could quickly
>> determine the size of the type to be emitted and compute future type
>> offsets (eg: if 5 types were in the queue, it could've figured out the
>> offset of those types already) to answer type offset queries quickly and
>> unblock the parallel threads to continue emitting their CUs containing type
>> references.
>>
>> yes. Thank you. Would think about it.
>>
>> Alexey.
>>
>>
>> - Dave
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201101/a0309e8c/attachment.html>