[llvm-dev] Fragmented DWARF

Mon Oct 12 06:41:21 PDT 2020

Hi all,

At the recent LLVM developers' meeting, I presented a lightning talk on an
approach to reduce the amount of dead debug data left in an executable
following operations such as --gc-sections and duplicate COMDAT removal. In
that presentation, I presented some figures based on linking a game that
had been built by our downstream clang port and fragmented using the
described approach. Since recording the presentation, I ran the same
experiment on a clang package (this time built with a GCC version). The
comparable figures are below:

Link-time speed (s):
+--------------------+-------+---------------+------+------+------+------+------+
| Package variant    | No GC | GC 1 (normal) | GC 2 | GC 3 | GC 4 | GC 5 |
GC 6 |
+--------------------+-------+---------------+------+------+------+------+------+
| Game (plain)       |  4.5  |  4.9          |  4.2 |  3.6 |  3.4 |  3.3 |
3.2 |
| Game (fragmented)  | 11.1  | 11.8          |  9.7 |  8.6 |  7.9 |  7.7 |
7.5 |
| Clang (plain)      | 13.9  | 17.9          | 17.0 | 16.7 | 16.3 | 16.2 |
16.1 |
| Clang (fragmented) | 18.6  | 22.8          | 21.6 | 21.1 | 20.8 | 20.5 |
20.2 |
+--------------------+-------+---------------+------+------+------+------+------+

Output size - Game package (MB):
+---------------------+-------+------+------+------+------+------+------+
| Category            | No GC | GC 1 | GC 2 | GC 3 | GC 4 | GC 5 | GC 6 |
+---------------------+-------+------+------+------+------+------+------+
| Plain (total)       | 1149  | 1121 | 1017 |  965 |  938 |  930 |  928 |
| Plain (DWARF*)      |  845  |  845 |  845 |  845 |  845 |  845 |  845 |
| Plain (other)       |  304  |  276 |  172 |  120 |   93 |   85 |   82 |
| Fragmented (total)  | 1044  |  940 |  556 |  373 |  287 |  263 |  255 |
| Fragmented (DWARF*) |  740  |  664 |  384 |  253 |  194 |  178 |  173 |
| Fragmented (other)  |  304  |  276 |  172 |  120 |   93 |   85 |   82 |
+---------------------+-------+------+------+------+------+------+------+

Output size - Clang (MB):
+---------------------+-------+------+------+------+------+------+------+
| Category            | No GC | GC 1 | GC 2 | GC 3 | GC 4 | GC 5 | GC 6 |
+---------------------+-------+------+------+------+------+------+------+
| Plain (total)       | 2596  | 2546 | 2406 | 2332 | 2293 | 2273 | 2251 |
| Plain (DWARF*)      | 1979  | 1979 | 1979 | 1979 | 1979 | 1979 | 1979 |
| Plain (other)       |  616  |  567 |  426 |  353 |  314 |  294 |  272 |
| Fragmented (total)  | 2397  | 2346 | 2164 | 2069 | 2017 | 1990 | 1963 |
| Fragmented (DWARF*) | 1780  | 1780 | 1738 | 1716 | 1703 | 1696 | 1691 |
| Fragmented (other)  |  616  |  567 |  426 |  353 |  314 |  294 |  272 |
+---------------------+-------+------+------+------+------+------+------+

*DWARF size == total size of .debug_info + .debug_line + .debug_ranges +
.debug_aranges + .debug_loc

Additionally, I have posted https://reviews.llvm.org/D89229 which provides
the python script and linker patches used to reproduce the above results on
my machine. The GC 1/2/3/4/5/6 correspond to the linker option added in
that patch --mark-live-pc with values 1/0.8/0.6/0.4/0.2/0 respectively.

During the conference, the question was asked what the memory usage and
input size impact was. I've summarised these below:

Input file size total (GB):
+--------------------+------------+
| Package variant    | Total Size |
+--------------------+------------+
| Game (plain)       |     2.9    |
| Game (fragmented)  |     4.2    |
| Clang (plain)      |    10.9    |
| Clang (fragmented) |    12.3    |
+--------------------+------------+

Peak Working Set Memory usage (GB):
+--------------------+-------+------+
| Package variant    | No GC | GC 1 |
+--------------------+-------+------+
| Game (plain)       |  4.3  |  4.7 |
| Game (fragmented)  |  8.9  |  8.6 |
| Clang (plain)      | 15.7  | 15.6 |
| Clang (fragmented) | 19.4  | 19.2 |
+--------------------+-------+------+

I'm keen to hear what people's feedback is, and also interested to see what
results others might see by running this experiment on other input
packages. Also, if anybody has any alternative ideas that meet the goals
listed below, I'd love to hear them!

To reiterate some key goals of fragmented DWARF, similar to what I said in
the presentation:
1) Devise a scheme that gives significant size savings without being too
costly. It's clear from just the two packages I've tried this on that there
is a fairly hefty link time performance cost, although the exact cost
depends on the nature of the input package. On the other hand, depending on
the nature of the input package, there can also be some big gains.
2) Devise a scheme that doesn't require any linker knowledge of DWARF. The
current approach doesn't quite achieve this properly due to the slight
misuse of SHF_LINK_ORDER, but I expect that a pivot to using non-COMDAT
group sections should solve this problem.
3) Provide some kind of halfway house between simply writing tombstone
values into dead DWARF and fully parsing the DWARF to reoptimise
its/discard the dead bits.

I'm hopeful that changes could be made to the linker to improve the
link-time cost. There seems to be a significant amount of the link time
spent creating the input sections. An alternative would be to devise a
scheme that would avoid the literal splitting into section headers, in
favour of some sort of list of split-points that the linker uses to split
things up (a bit like it already does for .eh_frame or mergeable sections).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201012/99adc6be/attachment.html>