<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, 6 Nov 2020 at 19:40, Fāng-ruì Sòng <<a href="mailto:maskray@google.com">maskray@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, Nov 6, 2020 at 2:32 AM James Henderson via llvm-dev<br>

<<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br>

><br>

> Hi Alexey,<br>

><br>

> On Thu, 5 Nov 2020 at 21:02, Alexey Lapshin <<a href="mailto:avl.lapshin@gmail.com" target="_blank">avl.lapshin@gmail.com</a>> wrote:<br>

>><br>

>> Hi James,<br>

>><br>

>> On 05.11.2020 17:59, James Henderson wrote:<br>

>><br>

>> (Resending with history trimmed to avoid it getting stuck in moderator queue).<br>

>><br>

>> Hi Alexey,<br>

>><br>

>> Just an update - I identified the cause of the "Generated debug info is broken" error message when I tried to build things locally: the `outStreamer` instance is initialised with the host Triple, instead of whatever the target's triple is. For example, I build and run LLD on Windows, which means that a Windows triple will be generated, and consequently a COFF-emitting streamer will be created, rather than the ELF-emitting one I'd expect were the triple information to somehow be derived from the linker flavor/input objects etc. Hard-coding in my target triple resolved the issue (although I still got the other warnings mentioned from my game link).<br>

>><br>

>>    Thank you for the details. Actually, I did not test this on Windows.  But I would do and update the patch.<br>

>><br>

>><br>

>><br>

>> I measured the performance figures using LLD patched as described, and using the same methodology as my earlier results, and got the following:<br>

>><br>

>> Link-time speed (s):<br>

>> +-----------------------------+---------------+<br>

>> | Package variant             | GC 1 (normal) |<br>

>> +-----------------------------+---------------+<br>

>> | Game (DWARF linker)         |  53.6         |<br>

>> | Game (DWARF linker, no ODR) |  63.6         |<br>

>> | Clang (DWARF linker)        | 200.6         |<br>

>> +-----------------------------+---------------+<br>

>><br>

>> Output size - Game package (MB):<br>

>> +-----------------------------+------+<br>

>> | Category                    | GC 1 |<br>

>> +-----------------------------+------+<br>

>> | DWARFLinker (total)         |  696 |<br>

>> | DWARFLinker (DWARF*)        |  429 |<br>

>> | DWARFLinker (other)         |  267 |<br>

>> | DWARFLinker no ODR (total)  |  753 |<br>

>> | DWARFLinker no ODR (DWARF*) |  485 |<br>

>> | DWARFLinker no ODR (other)  |  268 |<br>

>> +-----------------------------+------+<br>

>><br>

>> Output size - Clang (MB):<br>

>> +-----------------------------+------+<br>

>> | Category                    | GC 1 |<br>

>> +-----------------------------+------+<br>

>> | DWARFLinker (total)         | 1294 |<br>

>> | DWARFLinker (DWARF*)        |  743 |<br>

>> | DWARFLinker (other)         |  551 |<br>

>> | DWARFLinker no ODR (total)  | 1294 |<br>

>> | DWARFLinker no ODR (DWARF*) |  743 |<br>

>> | DWARFLinker no ODR (other)  |  551 |<br>

>> +-----------------------------+------+<br>

>><br>

>> *DWARF = just .debug_info, .debug_line, .debug_loc, .debug_aranges, .debug_ranges.<br>

>><br>

>> Peak Working Set Memory usage (GB):<br>

>> +-----------------------------+------+<br>

>> | Package variant             | GC 1 |<br>

>> +-----------------------------+------+<br>

>> | Game (DWARFLinker)          |  5.7 |<br>

>> | Game (DWARFLinker, no ODR)  |  5.8 |<br>

>> | Clang (DWARFLinker)         | 22.4 |<br>

>> | Clang (DWARFLinker, no ODR) | 22.5 |<br>

>> +-----------------------------+------+<br>

>><br>

>> My opinion is that the time costs of the DWARF Linker approach are not really practical except on build servers, in the current state of affairs for larger packages: clang takes 8.8x as long as the fragmented approach and 11.2x as long as the plain approach (without the no ODR option). The size saving is certainly good, with my version of clang 51% of the total output size for the DWARF linker approach versus the plain approach and 55% of the fragmented approach (though it is likely that further size savings might be possible for the latter). The game produced reasonable size savings too: 62% and 74%, but I'd be surprised if these gains would be enough for people to want to use the approach in day-to-day situations, which presumably is the main use-case for smaller DWARF, due to improved debugger load times.<br>

>><br>

>> Interesting to note is that the GCC 7.5 build of clang I've used these figures with produced no difference in size results between the two variants, unlike other packages. Consequently, a significant amount of time is saved for no penalty.<br>

>><br>

>> I'll be interested to see what the time results of the DWARF linker are once further improvements to it have been made.<br>

>><br>

>> yep, current time costs of the DWARFLinker are too high. One of the reasons is that lld handles sections in parallel, while DWARFLinker handles data sequentially. Probably DWARFLinker numbers could be improved if it would be possible to teach it to handle data in parallel. Thank you for the comparison!<br>

><br>

> No problem! It was useful for me to gather the numbers for internal investigations too. Parallelisation would hopefully help, but at this point it's hard to say by how much. There are likely going to be additional time costs for fragmented DWARF too, once I fix the remaining deficiencies, as they'll require more relocations.<br>

><br>

>><br>

>> Speaking of "Fragmented DWARF" solution, how do you estimate memory requirements to support fragmented object files ?<br>

><br>

> I'm not sure if you're referring to the memory usage at link time or the disk space required for the inputs, but I posted both those figures in my original post in this thread. If it's something else, please let me know. Based on those figures, it's clear the cost depends on the input code base, but it was between 25 and 75% or so bigger object file size and 50 and 100% more memory usage. Again, these are likely both to go up when I get around to fixing the remaining issues.<br>

>><br>

>> In comments for your Lightning Talk you have mentioned that it would be necessary to "update DebugInfo library to treat the fragmented sections as one continuous section". Do you think it would be cheap to implement?<br>

><br>

> I think so. I'd hope it would be possible to replace the data buffer underlying the DWARF section parsing to be able to "jump" to the next fragment (section) when it gets to the end of the previous one. I haven't experimented with this, but I wouldn't expect it to be costly in terms of code quality or performance, at least in comparison to parsing the DWARF itself.<br>

<br>

sizeof(InputSection) is 208 (sizeof(Elf64_Shdr)=64) so there is indeed<br>

a significant overhead on fragmented segments.<br>

A MergeInputSection can be split into SectionPiece, which is indeed<br>

lightweight and MarkLive can mark liveness on these pieces. However,<br>

in InputFiles.cpp we<br>

change MergeInputSection to regular if it has a relocation<br>

(toRegularSection). Using more lightweight data structures for<br>

.debug_* fragments is still challenging.<br></blockquote><div><br></div><div>Right, the overhead of additional sections is certainly a potential problem with Fragmented DWARF. I suspect this is where the majority of the cost compared to a "plain" link comes from. Indeed, I suspect it will get worse if I continue developing this concept, as I'll need to switch to debug data fragments being parts of groups together with their corresponding function/data piece, which add yet more overhead. For other tools like llvm-dwarfdump, I doubt this cost is as significant, or at least isn't as important, but I haven't experimented with them to confirm. I'm currently considering ways to mitigate the section header overhead in the linker, inspired by how eh_frame and mergeable sections work in LLD. One idea I had, which might also help with the overhead of -ffunction-sections/-fdata-sections, was to have a separate section that indicated the split points, and then the linker would internally fragment the sections as dictated by this split point section. I haven't explored this idea yet beyond that high-level concept, but it would at least save on I/O to some degree, if not memory cost.<br></div></div></div>