[llvm-dev] [Proposal][Debuginfo] dsymutil-like tool for ELF.

Mon Nov 2 10:12:31 PST 2020

On Mon, Nov 2, 2020 at 2:26 AM Alexey Lapshin <avl.lapshin at gmail.com> wrote:

>
> On 02.11.2020 04:11, David Blaikie wrote:
>
> I think if we're in the realm of DWARF extensions a whole bunch of other
> considerations come into it (& indeed, your suggested proposal may be a
> good one - but I think it's a very wide problem space once we're
> considering DWARF extensions). Mostly I was making
> arguments/suggestions/thoughts on the basis of being compatible with all
> existing DWARF producers.
>
> the described scenario does not assume DWARF extensions. global type table
> is not new DWARF construction. This is an artificial CU keeping all types.
> That solution would be compatible with existing DWARF consumers/produces.
>
Sorry, guess I'm not following. Maybe this conversation's getting a bit too
abstract/theoretical/forward looking for me right now - no worries. Happy
to chat more about it, but might be easier to focus on the immediate steps
forward for now & tackle this when it's the thing you're planning to work
on? (if I'm understanding correctly that this isn't a direction you're
thinking to try right now)

>
>
> On Sun, Nov 1, 2020 at 2:05 PM Alexey Lapshin <avl.lapshin at gmail.com>
> wrote:
>
>>
>> On 28.10.2020 20:38, David Blaikie wrote:
>>
>>
>>
>> On Wed, Oct 28, 2020 at 6:01 AM Alexey Lapshin <avl.lapshin at gmail.com>
>> wrote:
>>
>>>
>>> On 28.10.2020 01:49, David Blaikie wrote:
>>>
>>>
>>>
>>> On Tue, Oct 27, 2020 at 12:34 PM Alexey Lapshin <avl.lapshin at gmail.com>
>>> wrote:
>>>
>>>>
>>>> On 27.10.2020 20:32, David Blaikie wrote:
>>>>
>>>>
>>>>
>>>> On Tue, Oct 27, 2020 at 1:23 AM Alexey Lapshin <avl.lapshin at gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> On 26.10.2020 22:38, David Blaikie wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Oct 25, 2020 at 9:31 AM Alexey Lapshin <avl.lapshin at gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On 23.10.2020 19:43, David Blaikie wrote:
>>>>>>
>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Ah, yeah - that seems like a missed opportunity - duplicating the
>>>>>>> whole type DIE. LTO does this by making monolithic types - merging all the
>>>>>>> members from different definitions of the same type into one, but that's
>>>>>>> maybe too expensive for dsymutil (might still be interesting to know how
>>>>>>> much more expensive, etc). But I think the other way to go would be to
>>>>>>> produce a declaration of the type, with the relevant members - and let the
>>>>>>> DWARF consumer identify this declaration as matching up with the earlier
>>>>>>> definition. That's the sort of DWARF you get from the non-MachO default
>>>>>>> -fno-standalone-debug anyway, so it's already pretty well tested/supported
>>>>>>> (support in lldb's a bit younger/more work-in-progress, admittedly). I
>>>>>>> wonder how much dsym size there is that could be reduced by such an
>>>>>>> implementation.
>>>>>>>
>>>>>>> I see. Yes, that could be done and I think it would result in
>>>>>>> noticeable size reduction(I do not know exact numbers at the moment).
>>>>>>>
>>>>>>> I work on multi-thread DWARFLinker now and it`s first version will
>>>>>>> do exactly the same type processing like current dsymutil.
>>>>>>>
>>>>>> Yeah, best to keep the behavior the same through that
>>>>>>
>>>>>>> Above scheme could be implemented as a next step and it would result
>>>>>>> in better size reduction(better than current state).
>>>>>>>
>>>>>>> But I think the better scheme could be done also and it would result
>>>>>>> in even bigger size reduction and in faster execution. This scheme is
>>>>>>> something similar to what you`ve described above: "LTO does - making
>>>>>>> monolithic types - merging all the members from different definitions of
>>>>>>> the same type into one".
>>>>>>>
>>>>>> I believe the reason that's probably not been done is that it can't
>>>>>> be streamed - it'd lead to buffering more of the output
>>>>>>
>>>>>> yes. The fact that DWARF should be streamed into AsmPrinter
>>>>>> complicates parallel dwarf generation. In my prototype, I generate
>>>>>> several resulting files(each for one source compilation unit) and
>>>>>> then sequentially glue them into the final resulting file.
>>>>>>
>>>>> How does that help? Do you use relocations in those intermediate
>>>>> object files so the DWARF in them can refer across files?
>>>>>
>>>>> It does not help with referring across the file. It helps to parallel
>>>>> the generation of CU bodies.
>>>>> It is not possible to write two CUs in parallel into AsmPrinter. To
>>>>> make possible parallel generation I stream them into different
>>>>> AsmPrinters(this comment is for "I believe the reason that's probably not
>>>>> been done is that it can't be streamed". which initially was about
>>>>> referring across the file, but it seems I added another direction).
>>>>>
>>>> Oh, I see - thanks for explaining, essentially buffering on-disk.
>>>>
>>>>>
>>>>>> (if two of these expandable types were in one CU - the start of the
>>>>>> second type couldn't be known until the end because it might keep getting
>>>>>> pushed later due to expansion of the first type) and/or having to revisit
>>>>>> all the type references (the offset to the second type wouldn't be known
>>>>>> until the end - so writing the offsets to refer to the type would have to
>>>>>> be deferred until then).
>>>>>>
>>>>>> That is the second problem: offsets are not known until the end of
>>>>>> file.
>>>>>> dsymutil already has that situation for inter-CU references, so it
>>>>>> has extra pass to
>>>>>> fixup offsets.
>>>>>>
>>>>> Oh, it does? I figured it was one-pass, and that it only ever refers
>>>>> back to types in previous CUs? So it doesn't have to go back and do a
>>>>> second pass. But I guess if sees a declaration of T1 in CU1, then later on
>>>>> sees a definition of T1 in CU2, does it somehow go back to CU1 and remove
>>>>> the declaration/make references refer to the definition in CU2? I figured
>>>>> it'd just leave the declaration and references to it as-is, then add the
>>>>> definition and use that from CU2 onwards?
>>>>>
>>>>> For the processing of the types, it do not go back.
>>>>> This "I figured it was one-pass, and that it only ever refers back to
>>>>> types in previous CUs"
>>>>> and this "I figured it'd just leave the declaration and references to
>>>>> it as-is, then add the definition and use that from CU2 onwards" are
>>>>> correct.
>>>>>
>>>> Great - thanks for explaining/confirming!
>>>>
>>>>>
>>>>> With multi-thread implementation such situation would arise more often
>>>>>> for type references and so more offsets should be fixed during
>>>>>> additional pass.
>>>>>>
>>>>>> DWARFLinker could create additional artificial compile unit and put
>>>>>>> all merged types there. Later patch all type references to point into this
>>>>>>> additional compilation unit.  No any bits would be duplicated in that case.
>>>>>>> The performance improvement could be achieved due to less amount of the
>>>>>>> copied DWARF and due to the fact that type references could be updated when
>>>>>>> DWARF is cloned(no need in additional pass for that).
>>>>>>>
>>>>>> "later patch all type references to point into this additional
>>>>>> compilation unit" - that's the additional pass that people are probably
>>>>>> talking/concerned about. Rewalking all the DWARF. The current dsymutil
>>>>>> approach, as far as I know, is single pass - it knows the final, absolute
>>>>>> offset to the type from the moment it emits that type/needs to refer to it.
>>>>>>
>>>>>> Right. Current dsymutil approach is single pass. And from that point
>>>>>> of view, solution
>>>>>> which you`ve described(to produce a declaration of the type, with the
>>>>>> relevant members)
>>>>>> allows to keep that single pass implementation.
>>>>>>
>>>>>> But there is a restriction for current dsymutil approach: To process
>>>>>> inter-CU references
>>>>>> it needs to load all DWARF into the memory(While it analyzes which
>>>>>> part of DWARF is live,
>>>>>> it needs to have all CUs loaded into the memory).
>>>>>>
>>>>> All DWARF for a single file (which for dsymutil is mostly a single CU,
>>>>> except with LTO I guess?), not all DWARF for all inputs in memory at once,
>>>>> yeah?
>>>>>
>>>>> right. In dsymutil case - all DWARF for a single file(not all DWARF
>>>>> for all inputs in memory at once).
>>>>> But in llvm-dwarfutil case single file contains DWARF for all original
>>>>> input object files and it all becomes
>>>>> loaded into memory.
>>>>>
>>>> Yeha, would be great to try to go CU-by-CU.
>>>>
>>>>> That leads to huge memory usage.
>>>>>> It is less important when source is a set of object files(like in
>>>>>> dsymutil case) and this
>>>>>> become a real problem for llvm-dwarfutil utility when source is a
>>>>>> single file(With current
>>>>>> implementation it needs 30G of memory for compiling clang binary).
>>>>>>
>>>>> Yeah, that's where I think you'd need a fixup pass one way or another
>>>>> - because cross-CU references can mean that when you figure out a new
>>>>> layout for CU5 (because it has a duplicate type definition of something in
>>>>> CU1) then you might have to touch CU4 that had an absolute/cross-CU forward
>>>>> reference to CU5. Once you've got such a fixup pass (if dsymutil already
>>>>> has one? Which, like I said, I'm confused why it would have one/that
>>>>> doesn't match my very vague understanding) then I think you could make
>>>>> dsymutil work on a per-CU basis streaming things out, then fixing up a few
>>>>> offsets.
>>>>>
>>>>> When dsymutil deduplicates types it changes local CU reference into
>>>>> inter-CU reference(so that CU2(next) could reference type definition from
>>>>> CU1(prev)). To do this change it does not need to do any fixups currently.
>>>>>
>>>>> When dsymutil meets already existed(located in the input object file)
>>>>> inter-CU reference pointing into the CU which has not been processed
>>>>> yet(and then its offset is unknown) it marks it as "forward reference" and
>>>>> patches later during additional pass "fixup forward references" at a time
>>>>> when offsets are known.
>>>>>
>>>> OK, so limited 2 pass system. (does it do that second pass once at the
>>>> end of the whole dsymutil run, or at the end of each input file? (so if an
>>>> input file has two CUs and the first CU references a type in the second CU
>>>> - it could write the first CU with a "forward reference", then write the
>>>> second CU, then fixup the forward reference - and then go on to the next
>>>> file and its CUs - this could improve performance by touching recently used
>>>> memory/disk pages only, rather than going all the way back to the start
>>>> later on when those pages have become cold)
>>>>
>>>> yes, It does it in the end of each input file.
>>>>
>>>>
>>>>
>>>>> If CUs would be processed in parallel their offsets would not be known
>>>>> at the moment when local type reference would be changed into inter-CU
>>>>> reference. So we would need to do the same fix-up processing for all
>>>>> references to the types like we already do for other inter-CU references.
>>>>>
>>>> Yeah - though the existence of this second "fixup forward references"
>>>> system - yeah, could just use it much more generally as you say. Not an
>>>> extra pass, just the existing second pass but having way more fixups to
>>>> fixup in that pass.
>>>>
>>>> If we would be able to change the algorithm in such way :
>>>>
>>>> 1. analyse all CUs.
>>>> 2. clone all CUs.
>>>>
>>>> Then we could create a merged type table(artificial CU containing
>>>> types) during step1.
>>>> If that type table would be written first, then all following CUs could
>>>> use known offsets
>>>> to the types and we would not need additional fix-up processing for
>>>> type references.
>>>> It would still be necessary to fix-up other inter-CU references. But it
>>>> would not be necessary
>>>> to fix-up type references (which constitute the vast majority).
>>>>
>>>
>>> To me, that sounds more expensive than the fixup forward references pass.
>>>
>>> If we would speak about direct comparison then yes loading DWARF one
>>> more time looks more expensive than fixup forward references pass. But if
>>> we would speak about the general picture then it could probably be
>>> beneficial:
>>>
>>> 1. merging types would lead to a smaller size of resulting DWARF. This
>>> would speed up the process.
>>>    f.e. If we would switch "odr types deduplication" off in current
>>> implementation then it would increase execution time two times. That is
>>> because more DWARF should be cloned and written in the result.
>>> Implementation of "merging types" would probably have a similar effect
>>>    - It would speed-up the overall process. So from one side additional
>>> step for loading DWARF would
>>>    decrease performance but a smaller amount of resulting data would
>>> increase performance.
>>>
>>> 2. When types would be put in the first CU then we would have a simple
>>> strategy for our liveness analysis algorithm: just always keep the first CU
>>> in memory. This allows us to speed up our liveness analysis step.
>>>
>>> Anyway, all the above is just an idea for future work. Currently, I am
>>> going to implement multithread processing for CUs loaded into memory and
>>> having the same type of processing as it currently is(Which assumes that
>>> "fixup forward references pass" started to do more work by fixing types
>>> references).
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>> Without loading all CU into the memory it would require two passes
>>>>>> solution. First to analyze
>>>>>> which part of DWARF relates to live code and then second pass to
>>>>>> generate the result.
>>>>>>
>>>>> Not sure it'd require any more second pass than a "fixup" pass, which
>>>>> it sounds like you're saying it already has?
>>>>>
>>>>> It looks like it would need an additional pass to process inter-CU
>>>>> references(existed in incoming file) if we do not want to load all CUs into
>>>>> memory.
>>>>>
>>>> Usually inter-CU references aren't used, except in LTO - and in LTO all
>>>> the DWARF deduplication and function discarding is already done by the IR
>>>> linker anyway. (ThinLTO is a bit different, but really we'd be better off
>>>> teaching it the extra tricks anyway (some can't be fixed in ThinLTO - like
>>>> emitting a "Home" definition of an inline function, only to find out other
>>>> ThinLTO backend/shards managed to optimize away all uses of the function...
>>>> so some cleanup may be useful there)). It might be possible to do a more
>>>> dynamic/rolling cache - keep only the CUs with unresolved cross-CU
>>>> references alive and only keep them alive until their cross-CU references
>>>> are found/marked alive. This should make things no worse than the
>>>> traditional dsymutil case - since cross-CU references are only
>>>> effective/generally used within a single object file (it's possible to
>>>> create relocations for them into other files - but I know LLVM doesn't
>>>> currently do this and I don't think GCC does it) with multiple CUs anyway -
>>>> so at most you'd keep all the CUs from a single original input file alive
>>>> together.
>>>>
>>>> But, since it is a DWARF documented case the tool should be ready for
>>>> such case(when inter-CU
>>>> references are heavily used).
>>>>
>>>
>>> Sure - but by implementing a CU liveness window like that (keeping CUs
>>> live only so long as they need to be rather than an all-or-nothing
>>> approach) only especially quirky inputs would hit the worst case while the
>>> more normal inputs could perform better.
>>>
>>> It is not clear what should be put in such CU liveness window. If CU100
>>> references CU1 - how could we know that we need to put CU1 into CU liveness
>>> window before we processed CU100?
>>>
>> Fair point, not just forward references to worry about but backward
>> references too. I wonder how much savings there is in the liveness analysis
>> compared to "keep one copy of everything, no matter whether it's live or
>> not", then it can be a pure forward progress situation. (with the quirk
>> that you might emit a declaration for an entity once, then a definition for
>> it later - alternatively if a declaration is seen it could be skipped under
>> the assumption that a definition will follow (& use a forward ref fixup) -
>> and if none is found, splat some stub declarations into a trailing CU at
>> the end)
>>
>>>
>>>
>>>
>>>> Moreover, llvm-dwarfutil would be the tool producing
>>>> exactly such situation. The resulting file(produced by llvm-dwarfutil)
>>>> would contain a lot of
>>>> inter-CU references. Probably, there is no practical reasons to apply
>>>> llvm-dwarfutil to the same
>>>> file twice but it would be a good test for the tool.
>>>>
>>>
>>> It'd be a good stress test, but not necessarily something that would
>>> need to perform the best because it wouldn't be a common use case.
>>>
>>> I agree that we should not slow down the DWARFLinker in common cases
>>> only because we need to support the worst cases.
>>> But we also need to implement a solution which works in some acceptable
>>> manner for the worst case.
>>>
>> I think that depends on "acceptable" - correct, yes. Practical to run in
>> reasonable time/memory? Not necessarily, in my opinion.
>>
>>> The current solution - loading everything in memory - makes it hard to
>>> use in a non-dsymutil scenario(llvm-dwarfutil).
>>>
>> I agree it's worth exploring the non-dsymutil scenario, as you are - I'm
>> just saying we don't necessarily need to support high usability (fast/low
>> memory usage/etc) llvm-dwarfutil on an already dwarfutil'd binary (but as
>> you've pointed out, the "window" is unknowable because of backward
>> references, so this whole subthread is perhaps irrelevant).
>>
>>> There could be several things which could be used to decide whether we
>>> need to go on a light or heavy path:
>>>
>>> 1. If the input contains only a single CU we do not need to unload it
>>> from memory. Thus - we would not need to do an extra DWARF loading pass.
>>> 2. If abbreviations from the whole input file do not contain inter-CU
>>> references then while doing liveness analysis,  we do not need to wait
>>> until other CUs are processed.
>>>
>> (2) Yeah, that /may/ be a good idea, cheap to test, etc. Though I'd still
>> wonder if a more general implementation strategy could be found that would
>> make it easier to get a sliding scale of efficiency depending on how much
>> inter-CU references where were, not a "if there are none it's good, if
>> there are any it's bad or otherwise very different to implement".
>>
>> I think, there is a scenario which would make it possible to process CU
>> once for not referenced CUs and handle inter-CU references in a scalable
>> way(even for dwarfutil`d binary):
>>
>> 1. Implement a global type's table and types merging. This allows us to
>> have all types in the memory.
>>    Then, all inter-CU type references would point into that memory type
>> table.
>>    (we do not know which CU should be put into CU liveness window, we
>> also could not put all CUs into the memory, but we could put all types into
>> the memory).
>>
>> 2. If there are not other inter-CU references then all CUs would be
>> handled by one pass.
>>
>> 3. If there are other inter-CU references, then after all CU processed by
>> the first pass we would have a list of referenced CUs. Then, we could
>> delete already cloned data(for referenced CU) and start the process again:
>>    load CU, mark liveness, clone data. This second pass would be done for
>> only referenced CUs.
>>    For not-complex, not closely coupled cases it would work relatively
>> fast.
>>
>> 4. put memory type table into artificial CU. Update all type`s references.
>>
>>
>>
>>> Then that scheme would be used for worst cases:
>>>
>>> 1. for (CU : CU1...CU100) {
>>>      load CU.
>>>      analyse CU.
>>>      unload CU.
>>>    }
>>> 2. for (CU : CU1...CU100) {
>>>      load CU.
>>>      clone CU.
>>>      unload CU.
>>>    }
>>> 3. fixup forward references.
>>>
>>> and that scheme for light cases:
>>>
>>> 1. for (CU : CU1...CU100) {
>>>      load CU.
>>>      analyse CU.
>>>      clone CU.
>>>      unload CU.
>>>    }
>>> 2. fixup forward references.
>>>
>>>
>>>
>>>> Generally, I think we should not assume that inter-CU references would
>>>> be used in a limited way.
>>>>
>>>> Anyway, if this scheme:
>>>>
>>>> 1. analyse all CUs.
>>>> 2. clone all CUs.
>>>>
>>>> would work slow then we would need to continue with one-pass solution
>>>> and not support complex closely coupled inputs.
>>>>
>>>
>>> yeah, certainly seeing the data/experiments will be interesting, if you
>>> end up implementing some different strategies, etc.
>>>
>>> I guess one possibility for parallel generation could be something more
>>> like Microsoft's approach with a central debug info server that compilers
>>> communicate with - not that exact model, I mean, but if you've got parallel
>>> threads generating reduced DWARF into separate object files - they could
>>> communicate with a single thread responsible for type emission - the type
>>> emitter would be given types from the separate threads and compute their
>>> size, queue them up to be streamed out to the type CU (& keep the source CU
>>> alive until that work was done) - such a central type emitter could quickly
>>> determine the size of the type to be emitted and compute future type
>>> offsets (eg: if 5 types were in the queue, it could've figured out the
>>> offset of those types already) to answer type offset queries quickly and
>>> unblock the parallel threads to continue emitting their CUs containing type
>>> references.
>>>
>>> yes. Thank you. Would think about it.
>>>
>>> Alexey.
>>>
>>>
>>> - Dave
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201102/ce463020/attachment-0001.html>