<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Oct 27, 2020 at 12:34 PM Alexey Lapshin <<a href="mailto:avl.lapshin@gmail.com">avl.lapshin@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  

    

  

  <div>

    <p><br>

    </p>

    <div>On 27.10.2020 20:32, David Blaikie

      wrote:<br>

    </div>

    <blockquote type="cite">

      

      <div dir="ltr">

        <div dir="ltr"><br>

        </div>

        <br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">On Tue, Oct 27, 2020 at 1:23

            AM Alexey Lapshin <<a href="mailto:avl.lapshin@gmail.com" target="_blank">avl.lapshin@gmail.com</a>>

            wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div>

              <p><br>

              </p>

              <div>On 26.10.2020 22:38, David Blaikie wrote:<br>

              </div>

              <blockquote type="cite">

                <div dir="ltr">

                  <div dir="ltr"><br>

                  </div>

                  <br>

                  <div class="gmail_quote">

                    <div dir="ltr" class="gmail_attr">On Sun, Oct 25,

                      2020 at 9:31 AM Alexey Lapshin <<a href="mailto:avl.lapshin@gmail.com" target="_blank">avl.lapshin@gmail.com</a>>

                      wrote:<br>

                    </div>

                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                      <div>

                        <p><br>

                        </p>

                        <div>On 23.10.2020 19:43, David Blaikie wrote:<br>

                        </div>

                        <blockquote type="cite">

                          <div dir="ltr">

                            <div class="gmail_quote">

                              <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                <blockquote type="cite">

                                  <div dir="ltr">

                                    <div class="gmail_quote">

                                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                        <div><br>

                                          <br>

                                        </div>

                                      </blockquote>

                                      <div><br>

                                      </div>

                                      <div>Ah, yeah - that seems like a

                                        missed opportunity - duplicating

                                        the whole type DIE. LTO does

                                        this by making monolithic types

                                        - merging all the members from

                                        different definitions of the

                                        same type into one, but that's

                                        maybe too expensive for dsymutil

                                        (might still be interesting to

                                        know how much more expensive,

                                        etc). But I think the other way

                                        to go would be to produce a

                                        declaration of the type, with

                                        the relevant members - and let

                                        the DWARF consumer identify this

                                        declaration as matching up with

                                        the earlier definition. That's

                                        the sort of DWARF you get from

                                        the non-MachO default

                                        -fno-standalone-debug anyway, so

                                        it's already pretty well

                                        tested/supported (support in

                                        lldb's a bit younger/more

                                        work-in-progress, admittedly). I

                                        wonder how much dsym size there

                                        is that could be reduced by such

                                        an implementation.</div>

                                    </div>

                                  </div>

                                </blockquote>

                                <p>I see. Yes, that could be done and I

                                  think it would result in noticeable

                                  size reduction(I do not know exact

                                  numbers at the moment).</p>

                                <p>I work on multi-thread DWARFLinker

                                  now and it`s first version will do

                                  exactly the same type processing like

                                  current dsymutil.</p>

                              </blockquote>

                              <div>Yeah, best to keep the behavior the

                                same through that</div>

                              <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                <div>

                                  <p>Above scheme could be implemented

                                    as a next step and it would result

                                    in better size reduction(better than

                                    current state).</p>

                                  <p>But I think the better scheme could

                                    be done also and it would result in

                                    even bigger size reduction and in

                                    faster execution. This scheme is

                                    something similar to what you`ve

                                    described above: "LTO does - making

                                    monolithic types - merging all the

                                    members from different definitions

                                    of the same type into one".</p>

                                </div>

                              </blockquote>

                              <div>I believe the reason that's probably

                                not been done is that it can't be

                                streamed - it'd lead to buffering more

                                of the output </div>

                            </div>

                          </div>

                        </blockquote>

                        <p>yes. The fact that DWARF should be streamed

                          into AsmPrinter complicates parallel dwarf

                          generation. In my prototype, I generate <br>

                          several resulting files(each for one source

                          compilation unit) and then sequentially glue

                          them into the final resulting file.<br>

                        </p>

                      </div>

                    </blockquote>

                    <div>How does that help? Do you use relocations in

                      those intermediate object files so the DWARF in

                      them can refer across files? <br>

                    </div>

                  </div>

                </div>

              </blockquote>

              <p>It does not help with referring across the file. It

                helps to parallel the generation of CU bodies. <br>

                It is not possible to write two CUs in parallel into

                AsmPrinter. To make possible parallel generation I

                stream them into different AsmPrinters(this comment is

                for "I believe the reason that's probably not been done

                is that it can't be streamed". which initially was about

                referring across the file, but it seems I added another

                direction).<br>

              </p>

            </div>

          </blockquote>

          <div>Oh, I see - thanks for explaining, essentially buffering

            on-disk. <br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div>

              <p> </p>

              <blockquote type="cite">

                <div dir="ltr">

                  <div class="gmail_quote">

                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                      <div>

                        <p> </p>

                        <p><br>

                        </p>

                        <blockquote type="cite">

                          <div dir="ltr">

                            <div class="gmail_quote">

                              <div>(if two of these expandable types

                                were in one CU - the start of the second

                                type couldn't be known until the end

                                because it might keep getting pushed

                                later due to expansion of the first

                                type) and/or having to revisit all the

                                type references (the offset to the

                                second type wouldn't be known until the

                                end - so writing the offsets to refer to

                                the type would have to be deferred until

                                then).<br>

                              </div>

                            </div>

                          </div>

                        </blockquote>

                        <p>That is the second problem: offsets are not

                          known until the end of file.<br>

                          dsymutil already has that situation for

                          inter-CU references, so it has extra pass to<br>

                          fixup offsets. </p>

                      </div>

                    </blockquote>

                    <div>Oh, it does? I figured it was one-pass, and

                      that it only ever refers back to types in previous

                      CUs? So it doesn't have to go back and do a second

                      pass. But I guess if sees a declaration of T1 in

                      CU1, then later on sees a definition of T1 in CU2,

                      does it somehow go back to CU1 and remove the

                      declaration/make references refer to the

                      definition in CU2? I figured it'd just leave the

                      declaration and references to it as-is, then add

                      the definition and use that from CU2 onwards? <br>

                    </div>

                  </div>

                </div>

              </blockquote>

              <p>For the processing of the types, it do not go back. <br>

                This "I figured it was one-pass, and that it only ever

                refers back to types in previous CUs" <br>

                and this "I figured it'd just leave the declaration and

                references to it as-is, then add the definition and use

                that from CU2 onwards" are correct. <br>

              </p>

            </div>

          </blockquote>

          <div>Great - thanks for explaining/confirming! </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div>

              <p> <br>

              </p>

              <blockquote type="cite">

                <div dir="ltr">

                  <div class="gmail_quote">

                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                      <div>

                        <p>With multi-thread implementation such

                          situation would arise more often <br>

                          for type references and so more offsets should

                          be fixed during additional pass.<br>

                        </p>

                        <blockquote type="cite">

                          <div dir="ltr">

                            <div class="gmail_quote">

                              <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                <div>

                                  <p>DWARFLinker could create additional

                                    artificial compile unit and put all

                                    merged types there. Later patch all

                                    type references to point into this

                                    additional compilation unit.  No any

                                    bits would be duplicated in that

                                    case. The performance improvement

                                    could be achieved due to less amount

                                    of the copied DWARF and due to the

                                    fact that type references could be

                                    updated when DWARF is cloned(no need

                                    in additional pass for that).<br>

                                  </p>

                                </div>

                              </blockquote>

                              <div>"later patch all type references to

                                point into this additional compilation

                                unit" - that's the additional pass that

                                people are probably talking/concerned

                                about. Rewalking all the DWARF. The

                                current dsymutil approach, as far as I

                                know, is single pass - it knows the

                                final, absolute offset to the type from

                                the moment it emits that type/needs to

                                refer to it. <br>

                              </div>

                            </div>

                          </div>

                        </blockquote>

                        <p>Right. Current dsymutil approach is single

                          pass. And from that point of view, solution <br>

                          which you`ve described(to produce a

                          declaration of the type, with the relevant

                          members) <br>

                          allows to keep that single pass

                          implementation.<br>

                          <br>

                          But there is a restriction for current

                          dsymutil approach: To process inter-CU

                          references <br>

                          it needs to load all DWARF into the

                          memory(While it analyzes which part of DWARF

                          is live, <br>

                          it needs to have all CUs loaded into the

                          memory).</p>

                      </div>

                    </blockquote>

                    <div>All DWARF for a single file (which for dsymutil

                      is mostly a single CU, except with LTO I guess?),

                      not all DWARF for all inputs in memory at once,

                      yeah? <br>

                    </div>

                  </div>

                </div>

              </blockquote>

              <p>right. In dsymutil case - all DWARF for a single

                file(not all DWARF for all inputs in memory at once).<br>

                But in llvm-dwarfutil case single file contains DWARF

                for all original input object files and it all becomes<br>

                loaded into memory.<br>

              </p>

            </div>

          </blockquote>

          <div>Yeha, would be great to try to go CU-by-CU. </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div>

              <blockquote type="cite">

                <div dir="ltr">

                  <div class="gmail_quote">

                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                      <div>

                        <p>That leads to huge memory usage. <br>

                          It is less important when source is a set of

                          object files(like in dsymutil case) and this <br>

                          become a real problem for llvm-dwarfutil

                          utility when source is a single file(With

                          current <br>

                          implementation it needs 30G of memory for

                          compiling clang binary).<br>

                        </p>

                      </div>

                    </blockquote>

                    <div>Yeah, that's where I think you'd need a fixup

                      pass one way or another - because cross-CU

                      references can mean that when you figure out a new

                      layout for CU5 (because it has a duplicate type

                      definition of something in CU1) then you might

                      have to touch CU4 that had an absolute/cross-CU

                      forward reference to CU5. Once you've got such a

                      fixup pass (if dsymutil already has one? Which,

                      like I said, I'm confused why it would have

                      one/that doesn't match my very vague

                      understanding) then I think you could make

                      dsymutil work on a per-CU basis streaming things

                      out, then fixing up a few offsets.<br>

                    </div>

                  </div>

                </div>

              </blockquote>

              <p>When dsymutil deduplicates types it changes local CU

                reference into inter-CU reference(so that CU2(next)

                could reference type definition from CU1(prev)). To do

                this change it does not need to do any fixups currently.<br>

                <br>

                When dsymutil meets already existed(located in the input

                object file) inter-CU reference pointing into the CU

                which has not been processed yet(and then its offset is

                unknown) it marks it as "forward reference" and patches

                later during additional pass "fixup forward references"

                at a time when offsets are known. <br>

              </p>

            </div>

          </blockquote>

          <div>OK, so limited 2 pass system. (does it do that second

            pass once at the end of the whole dsymutil run, or at the

            end of each input file? (so if an input file has two CUs and

            the first CU references a type in the second CU - it could

            write the first CU with a "forward reference", then write

            the second CU, then fixup the forward reference - and then

            go on to the next file and its CUs - this could improve

            performance by touching recently used memory/disk pages

            only, rather than going all the way back to the start later

            on when those pages have become cold)</div>

        </div>

      </div>

    </blockquote>

    <p>yes, It does it in the end of each input file.</p>

    <p><br>

    </p>

    <blockquote type="cite">

      <div dir="ltr">

        <div class="gmail_quote">

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div>

              <p> <br>

                If CUs would be processed in parallel their offsets

                would not be known at the moment when local type

                reference would be changed into inter-CU reference. So

                we would need to do the same fix-up processing for all

                references to the types like we already do for other

                inter-CU references.<br>

              </p>

            </div>

          </blockquote>

          <div>Yeah - though the existence of this second "fixup forward

            references" system - yeah, could just use it much more

            generally as you say. Not an extra pass, just the existing

            second pass but having way more fixups to fixup in that

            pass.</div>

        </div>

      </div>

    </blockquote>

    If we would be able to change the algorithm in such way : <br>

    <br>

    1. analyse all CUs.<br>

    2. clone all CUs.<br>

    <br>

    Then we could create a merged type table(artificial CU containing

    types) during step1. <br>

    If that type table would be written first, then all following CUs

    could use known offsets <br>

    to the types and we would not need additional fix-up processing for

    type references. <br>

    It would still be necessary to fix-up other inter-CU references. But

    it would not be necessary <br>

    to fix-up type references (which constitute the vast majority).<br></div></blockquote><div><br></div><div>To me, that sounds more expensive than the fixup forward references pass.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>

    <br>

    <blockquote type="cite">

      <div dir="ltr">

        <div class="gmail_quote">

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div>

              <p> <br>

              </p>

              <blockquote type="cite">

                <div dir="ltr">

                  <div class="gmail_quote">

                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                      <div>

                        <p>Without loading all CU into the memory it

                          would require two passes solution. First to

                          analyze <br>

                          which part of DWARF relates to live code and

                          then second pass to generate the result. <br>

                        </p>

                      </div>

                    </blockquote>

                    <div>Not sure it'd require any more second pass than

                      a "fixup" pass, which it sounds like you're saying

                      it already has? <br>

                    </div>

                  </div>

                </div>

              </blockquote>

              <p>It looks like it would need an additional pass to

                process inter-CU references(existed in incoming file) if

                we do not want to load all CUs into memory.<br>

              </p>

            </div>

          </blockquote>

          <div>Usually inter-CU references aren't used, except in LTO -

            and in LTO all the DWARF deduplication and function

            discarding is already done by the IR linker anyway. (ThinLTO

            is a bit different, but really we'd be better off teaching

            it the extra tricks anyway (some can't be fixed in ThinLTO -

            like emitting a "Home" definition of an inline function,

            only to find out other ThinLTO backend/shards managed to

            optimize away all uses of the function... so some cleanup

            may be useful there)). It might be possible to do a more

            dynamic/rolling cache - keep only the CUs with unresolved

            cross-CU references alive and only keep them alive until

            their cross-CU references are found/marked alive. This

            should make things no worse than the traditional dsymutil

            case - since cross-CU references are only

            effective/generally used within a single object file (it's

            possible to create relocations for them into other files -

            but I know LLVM doesn't currently do this and I don't think

            GCC does it) with multiple CUs anyway - so at most you'd

            keep all the CUs from a single original input file alive

            together.<br>

          </div>

        </div>

      </div>

    </blockquote>

    But, since it is a DWARF documented case the tool should be ready

    for such case(when inter-CU <br>

    references are heavily used).</div></blockquote><div><br>Sure - but by implementing a CU liveness window like that (keeping CUs live only so long as they need to be rather than an all-or-nothing approach) only especially quirky inputs would hit the worst case while the more normal inputs could perform better.<br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div> Moreover, llvm-dwarfutil would be the

    tool producing <br>

    exactly such situation. The resulting file(produced by

    llvm-dwarfutil) would contain a lot of <br>

    inter-CU references. Probably, there is no practical reasons to

    apply llvm-dwarfutil to the same <br>

    file twice but it would be a good test for the tool.<br></div></blockquote><div><br>It'd be a good stress test, but not necessarily something that would need to perform the best because it wouldn't be a common use case.<br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Generally, I think we should not assume that inter-CU references

    would be used in a limited way.<br>

    <br>

    Anyway, if this scheme:  <br>

    <br>

    1. analyse all CUs.<br>

    2. clone all CUs.<br>

    <p>would work slow then we would need to continue with one-pass

      solution and not support complex closely coupled inputs.<br></p></div></blockquote><div><br></div><div>yeah, certainly seeing the data/experiments will be interesting, if you end up implementing some different strategies, etc.<br><br>I guess one possibility for parallel generation could be something more like Microsoft's approach with a central debug info server that compilers communicate with - not that exact model, I mean, but if you've got parallel threads generating reduced DWARF into separate object files - they could communicate with a single thread responsible for type emission - the type emitter would be given types from the separate threads and compute their size, queue them up to be streamed out to the type CU (& keep the source CU alive until that work was done) - such a central type emitter could quickly determine the size of the type to be emitted and compute future type offsets (eg: if 5 types were in the queue, it could've figured out the offset of those types already) to answer type offset queries quickly and unblock the parallel threads to continue emitting their CUs containing type references.<br><br>- Dave </div></div></div>