<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Oct 27, 2020 at 12:34 PM Alexey Lapshin <<a href="mailto:avl.lapshin@gmail.com">avl.lapshin@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <p><br>
    </p>
    <div>On 27.10.2020 20:32, David Blaikie
      wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">
        <div dir="ltr"><br>
        </div>
        <br>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">On Tue, Oct 27, 2020 at 1:23
            AM Alexey Lapshin <<a href="mailto:avl.lapshin@gmail.com" target="_blank">avl.lapshin@gmail.com</a>>
            wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <p><br>
              </p>
              <div>On 26.10.2020 22:38, David Blaikie wrote:<br>
              </div>
              <blockquote type="cite">
                <div dir="ltr">
                  <div dir="ltr"><br>
                  </div>
                  <br>
                  <div class="gmail_quote">
                    <div dir="ltr" class="gmail_attr">On Sun, Oct 25,
                      2020 at 9:31 AM Alexey Lapshin <<a href="mailto:avl.lapshin@gmail.com" target="_blank">avl.lapshin@gmail.com</a>>
                      wrote:<br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div>
                        <p><br>
                        </p>
                        <div>On 23.10.2020 19:43, David Blaikie wrote:<br>
                        </div>
                        <blockquote type="cite">
                          <div dir="ltr">
                            <div class="gmail_quote">
                              <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                <blockquote type="cite">
                                  <div dir="ltr">
                                    <div class="gmail_quote">
                                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                        <div><br>
                                          <br>
                                        </div>
                                      </blockquote>
                                      <div><br>
                                      </div>
                                      <div>Ah, yeah - that seems like a
                                        missed opportunity - duplicating
                                        the whole type DIE. LTO does
                                        this by making monolithic types
                                        - merging all the members from
                                        different definitions of the
                                        same type into one, but that's
                                        maybe too expensive for dsymutil
                                        (might still be interesting to
                                        know how much more expensive,
                                        etc). But I think the other way
                                        to go would be to produce a
                                        declaration of the type, with
                                        the relevant members - and let
                                        the DWARF consumer identify this
                                        declaration as matching up with
                                        the earlier definition. That's
                                        the sort of DWARF you get from
                                        the non-MachO default
                                        -fno-standalone-debug anyway, so
                                        it's already pretty well
                                        tested/supported (support in
                                        lldb's a bit younger/more
                                        work-in-progress, admittedly). I
                                        wonder how much dsym size there
                                        is that could be reduced by such
                                        an implementation.</div>
                                    </div>
                                  </div>
                                </blockquote>
                                <p>I see. Yes, that could be done and I
                                  think it would result in noticeable
                                  size reduction(I do not know exact
                                  numbers at the moment).</p>
                                <p>I work on multi-thread DWARFLinker
                                  now and it`s first version will do
                                  exactly the same type processing like
                                  current dsymutil.</p>
                              </blockquote>
                              <div>Yeah, best to keep the behavior the
                                same through that</div>
                              <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                <div>
                                  <p>Above scheme could be implemented
                                    as a next step and it would result
                                    in better size reduction(better than
                                    current state).</p>
                                  <p>But I think the better scheme could
                                    be done also and it would result in
                                    even bigger size reduction and in
                                    faster execution. This scheme is
                                    something similar to what you`ve
                                    described above: "LTO does - making
                                    monolithic types - merging all the
                                    members from different definitions
                                    of the same type into one".</p>
                                </div>
                              </blockquote>
                              <div>I believe the reason that's probably
                                not been done is that it can't be
                                streamed - it'd lead to buffering more
                                of the output </div>
                            </div>
                          </div>
                        </blockquote>
                        <p>yes. The fact that DWARF should be streamed
                          into AsmPrinter complicates parallel dwarf
                          generation. In my prototype, I generate <br>
                          several resulting files(each for one source
                          compilation unit) and then sequentially glue
                          them into the final resulting file.<br>
                        </p>
                      </div>
                    </blockquote>
                    <div>How does that help? Do you use relocations in
                      those intermediate object files so the DWARF in
                      them can refer across files? <br>
                    </div>
                  </div>
                </div>
              </blockquote>
              <p>It does not help with referring across the file. It
                helps to parallel the generation of CU bodies. <br>
                It is not possible to write two CUs in parallel into
                AsmPrinter. To make possible parallel generation I
                stream them into different AsmPrinters(this comment is
                for "I believe the reason that's probably not been done
                is that it can't be streamed". which initially was about
                referring across the file, but it seems I added another
                direction).<br>
              </p>
            </div>
          </blockquote>
          <div>Oh, I see - thanks for explaining, essentially buffering
            on-disk. <br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <p> </p>
              <blockquote type="cite">
                <div dir="ltr">
                  <div class="gmail_quote">
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div>
                        <p> </p>
                        <p><br>
                        </p>
                        <blockquote type="cite">
                          <div dir="ltr">
                            <div class="gmail_quote">
                              <div>(if two of these expandable types
                                were in one CU - the start of the second
                                type couldn't be known until the end
                                because it might keep getting pushed
                                later due to expansion of the first
                                type) and/or having to revisit all the
                                type references (the offset to the
                                second type wouldn't be known until the
                                end - so writing the offsets to refer to
                                the type would have to be deferred until
                                then).<br>
                              </div>
                            </div>
                          </div>
                        </blockquote>
                        <p>That is the second problem: offsets are not
                          known until the end of file.<br>
                          dsymutil already has that situation for
                          inter-CU references, so it has extra pass to<br>
                          fixup offsets. </p>
                      </div>
                    </blockquote>
                    <div>Oh, it does? I figured it was one-pass, and
                      that it only ever refers back to types in previous
                      CUs? So it doesn't have to go back and do a second
                      pass. But I guess if sees a declaration of T1 in
                      CU1, then later on sees a definition of T1 in CU2,
                      does it somehow go back to CU1 and remove the
                      declaration/make references refer to the
                      definition in CU2? I figured it'd just leave the
                      declaration and references to it as-is, then add
                      the definition and use that from CU2 onwards? <br>
                    </div>
                  </div>
                </div>
              </blockquote>
              <p>For the processing of the types, it do not go back. <br>
                This "I figured it was one-pass, and that it only ever
                refers back to types in previous CUs" <br>
                and this "I figured it'd just leave the declaration and
                references to it as-is, then add the definition and use
                that from CU2 onwards" are correct. <br>
              </p>
            </div>
          </blockquote>
          <div>Great - thanks for explaining/confirming! </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <p> <br>
              </p>
              <blockquote type="cite">
                <div dir="ltr">
                  <div class="gmail_quote">
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div>
                        <p>With multi-thread implementation such
                          situation would arise more often <br>
                          for type references and so more offsets should
                          be fixed during additional pass.<br>
                        </p>
                        <blockquote type="cite">
                          <div dir="ltr">
                            <div class="gmail_quote">
                              <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                <div>
                                  <p>DWARFLinker could create additional
                                    artificial compile unit and put all
                                    merged types there. Later patch all
                                    type references to point into this
                                    additional compilation unit.  No any
                                    bits would be duplicated in that
                                    case. The performance improvement
                                    could be achieved due to less amount
                                    of the copied DWARF and due to the
                                    fact that type references could be
                                    updated when DWARF is cloned(no need
                                    in additional pass for that).<br>
                                  </p>
                                </div>
                              </blockquote>
                              <div>"later patch all type references to
                                point into this additional compilation
                                unit" - that's the additional pass that
                                people are probably talking/concerned
                                about. Rewalking all the DWARF. The
                                current dsymutil approach, as far as I
                                know, is single pass - it knows the
                                final, absolute offset to the type from
                                the moment it emits that type/needs to
                                refer to it. <br>
                              </div>
                            </div>
                          </div>
                        </blockquote>
                        <p>Right. Current dsymutil approach is single
                          pass. And from that point of view, solution <br>
                          which you`ve described(to produce a
                          declaration of the type, with the relevant
                          members) <br>
                          allows to keep that single pass
                          implementation.<br>
                          <br>
                          But there is a restriction for current
                          dsymutil approach: To process inter-CU
                          references <br>
                          it needs to load all DWARF into the
                          memory(While it analyzes which part of DWARF
                          is live, <br>
                          it needs to have all CUs loaded into the
                          memory).</p>
                      </div>
                    </blockquote>
                    <div>All DWARF for a single file (which for dsymutil
                      is mostly a single CU, except with LTO I guess?),
                      not all DWARF for all inputs in memory at once,
                      yeah? <br>
                    </div>
                  </div>
                </div>
              </blockquote>
              <p>right. In dsymutil case - all DWARF for a single
                file(not all DWARF for all inputs in memory at once).<br>
                But in llvm-dwarfutil case single file contains DWARF
                for all original input object files and it all becomes<br>
                loaded into memory.<br>
              </p>
            </div>
          </blockquote>
          <div>Yeha, would be great to try to go CU-by-CU. </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <blockquote type="cite">
                <div dir="ltr">
                  <div class="gmail_quote">
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div>
                        <p>That leads to huge memory usage. <br>
                          It is less important when source is a set of
                          object files(like in dsymutil case) and this <br>
                          become a real problem for llvm-dwarfutil
                          utility when source is a single file(With
                          current <br>
                          implementation it needs 30G of memory for
                          compiling clang binary).<br>
                        </p>
                      </div>
                    </blockquote>
                    <div>Yeah, that's where I think you'd need a fixup
                      pass one way or another - because cross-CU
                      references can mean that when you figure out a new
                      layout for CU5 (because it has a duplicate type
                      definition of something in CU1) then you might
                      have to touch CU4 that had an absolute/cross-CU
                      forward reference to CU5. Once you've got such a
                      fixup pass (if dsymutil already has one? Which,
                      like I said, I'm confused why it would have
                      one/that doesn't match my very vague
                      understanding) then I think you could make
                      dsymutil work on a per-CU basis streaming things
                      out, then fixing up a few offsets.<br>
                    </div>
                  </div>
                </div>
              </blockquote>
              <p>When dsymutil deduplicates types it changes local CU
                reference into inter-CU reference(so that CU2(next)
                could reference type definition from CU1(prev)). To do
                this change it does not need to do any fixups currently.<br>
                <br>
                When dsymutil meets already existed(located in the input
                object file) inter-CU reference pointing into the CU
                which has not been processed yet(and then its offset is
                unknown) it marks it as "forward reference" and patches
                later during additional pass "fixup forward references"
                at a time when offsets are known. <br>
              </p>
            </div>
          </blockquote>
          <div>OK, so limited 2 pass system. (does it do that second
            pass once at the end of the whole dsymutil run, or at the
            end of each input file? (so if an input file has two CUs and
            the first CU references a type in the second CU - it could
            write the first CU with a "forward reference", then write
            the second CU, then fixup the forward reference - and then
            go on to the next file and its CUs - this could improve
            performance by touching recently used memory/disk pages
            only, rather than going all the way back to the start later
            on when those pages have become cold)</div>
        </div>
      </div>
    </blockquote>
    <p>yes, It does it in the end of each input file.</p>
    <p><br>
    </p>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <p> <br>
                If CUs would be processed in parallel their offsets
                would not be known at the moment when local type
                reference would be changed into inter-CU reference. So
                we would need to do the same fix-up processing for all
                references to the types like we already do for other
                inter-CU references.<br>
              </p>
            </div>
          </blockquote>
          <div>Yeah - though the existence of this second "fixup forward
            references" system - yeah, could just use it much more
            generally as you say. Not an extra pass, just the existing
            second pass but having way more fixups to fixup in that
            pass.</div>
        </div>
      </div>
    </blockquote>
    If we would be able to change the algorithm in such way : <br>
    <br>
    1. analyse all CUs.<br>
    2. clone all CUs.<br>
    <br>
    Then we could create a merged type table(artificial CU containing
    types) during step1. <br>
    If that type table would be written first, then all following CUs
    could use known offsets <br>
    to the types and we would not need additional fix-up processing for
    type references. <br>
    It would still be necessary to fix-up other inter-CU references. But
    it would not be necessary <br>
    to fix-up type references (which constitute the vast majority).<br></div></blockquote><div><br></div><div>To me, that sounds more expensive than the fixup forward references pass.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>
    <br>
    <blockquote type="cite">
      <div dir="ltr">
        <div class="gmail_quote">
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <p> <br>
              </p>
              <blockquote type="cite">
                <div dir="ltr">
                  <div class="gmail_quote">
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div>
                        <p>Without loading all CU into the memory it
                          would require two passes solution. First to
                          analyze <br>
                          which part of DWARF relates to live code and
                          then second pass to generate the result. <br>
                        </p>
                      </div>
                    </blockquote>
                    <div>Not sure it'd require any more second pass than
                      a "fixup" pass, which it sounds like you're saying
                      it already has? <br>
                    </div>
                  </div>
                </div>
              </blockquote>
              <p>It looks like it would need an additional pass to
                process inter-CU references(existed in incoming file) if
                we do not want to load all CUs into memory.<br>
              </p>
            </div>
          </blockquote>
          <div>Usually inter-CU references aren't used, except in LTO -
            and in LTO all the DWARF deduplication and function
            discarding is already done by the IR linker anyway. (ThinLTO
            is a bit different, but really we'd be better off teaching
            it the extra tricks anyway (some can't be fixed in ThinLTO -
            like emitting a "Home" definition of an inline function,
            only to find out other ThinLTO backend/shards managed to
            optimize away all uses of the function... so some cleanup
            may be useful there)). It might be possible to do a more
            dynamic/rolling cache - keep only the CUs with unresolved
            cross-CU references alive and only keep them alive until
            their cross-CU references are found/marked alive. This
            should make things no worse than the traditional dsymutil
            case - since cross-CU references are only
            effective/generally used within a single object file (it's
            possible to create relocations for them into other files -
            but I know LLVM doesn't currently do this and I don't think
            GCC does it) with multiple CUs anyway - so at most you'd
            keep all the CUs from a single original input file alive
            together.<br>
          </div>
        </div>
      </div>
    </blockquote>
    But, since it is a DWARF documented case the tool should be ready
    for such case(when inter-CU <br>
    references are heavily used).</div></blockquote><div><br>Sure - but by implementing a CU liveness window like that (keeping CUs live only so long as they need to be rather than an all-or-nothing approach) only especially quirky inputs would hit the worst case while the more normal inputs could perform better.<br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div> Moreover, llvm-dwarfutil would be the
    tool producing <br>
    exactly such situation. The resulting file(produced by
    llvm-dwarfutil) would contain a lot of <br>
    inter-CU references. Probably, there is no practical reasons to
    apply llvm-dwarfutil to the same <br>
    file twice but it would be a good test for the tool.<br></div></blockquote><div><br>It'd be a good stress test, but not necessarily something that would need to perform the best because it wouldn't be a common use case.<br> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div>Generally, I think we should not assume that inter-CU references
    would be used in a limited way.<br>
    <br>
    Anyway, if this scheme:  <br>
    <br>
    1. analyse all CUs.<br>
    2. clone all CUs.<br>
    <p>would work slow then we would need to continue with one-pass
      solution and not support complex closely coupled inputs.<br></p></div></blockquote><div><br></div><div>yeah, certainly seeing the data/experiments will be interesting, if you end up implementing some different strategies, etc.<br><br>I guess one possibility for parallel generation could be something more like Microsoft's approach with a central debug info server that compilers communicate with - not that exact model, I mean, but if you've got parallel threads generating reduced DWARF into separate object files - they could communicate with a single thread responsible for type emission - the type emitter would be given types from the separate threads and compute their size, queue them up to be streamed out to the type CU (& keep the source CU alive until that work was done) - such a central type emitter could quickly determine the size of the type to be emitted and compute future type offsets (eg: if 5 types were in the queue, it could've figured out the offset of those types already) to answer type offset queries quickly and unblock the parallel threads to continue emitting their CUs containing type references.<br><br>- Dave </div></div></div>