[PATCH] D117853: [ELF] Parallelize --compress-debug-sections=zlib

Mon Jan 24 17:32:33 PST 2022

MaskRay added a comment.

In D117853#3267965 <https://reviews.llvm.org/D117853#3267965>, @dblaikie wrote:

> In D117853#3261870 <https://reviews.llvm.org/D117853#3261870>, @MaskRay wrote:
>
>> In D117853#3261856 <https://reviews.llvm.org/D117853#3261856>, @dblaikie wrote:
>>
>>> Is there any chance to avoid buffering the compressed output? (I guess probably not, because you need to know how large it is before you write it to the output file (if you want to parallelize writing sections, which is important no doubt))
>>
>> I have asked myself this question... Unfortunately no. To have accurate estimate of sizes, we have to buffer all compressed output.
>> It's needed to compute sh_offset and sh_size fields of a .debug_* section. To know the size we need to compress it first (or estimate, but the compression ratio is not easy to estimate).
>>
>> I think pigz uses an approach to only keep `concurrency` shards, but it does not have the requirement to know the output size beforehand.
>
> Yeah, I guess out of scope for this change - but maybe another time. It'd break parallelism, but you could stream out a section at a time (at least for the compressed sections) and then seek back to write the sh* offset fields based on how the compression actually worked out.
>
> I guess for Split DWARF the memory savings wouldn't be that significant, though? Do you have a sense of how much memory it'd take.

The saving is still large because of .debug_line.

Here is a `-DCMAKE_BUILD_TYPE=Debug -DLLVM_TARGETS_TO_BUILD=X86 -DCMAKE_CXX_FLAGS='-gdwarf-5 -gsplit-dwarf'` build of Clang.

  % ~/projects/bloaty/Release/bloaty lld
      FILE SIZE        VM SIZE    
   --------------  -------------- 
    38.0%   368Mi   0.0%       0    .debug_gnu_pubnames
    13.3%   129Mi  62.0%   129Mi    .text
    12.7%   123Mi   0.0%       0    .debug_line
    11.5%   111Mi   0.0%       0    .debug_gnu_pubtypes
    10.9%   105Mi   0.0%       0    .strtab
     2.8%  27.3Mi  13.1%  27.3Mi    .eh_frame
     2.4%  22.9Mi  11.0%  22.9Mi    .rodata
     2.2%  21.6Mi   0.0%       0    .debug_addr
     2.2%  21.0Mi   0.0%       0    .symtab
     1.3%  12.3Mi   5.9%  12.3Mi    .dynstr
     1.0%  9.37Mi   0.0%       0    .debug_rnglists
     0.7%  6.83Mi   3.3%  6.83Mi    .eh_frame_hdr
     0.4%  4.15Mi   2.0%  4.15Mi    .data.rel.ro
     0.3%  3.06Mi   1.5%  3.06Mi    .dynsym
     0.1%  1.02Mi   0.5%  1.02Mi    .hash
     0.1%   995Ki   0.0%       0    .debug_info
     0.1%   907Ki   0.4%   907Ki    .gnu.hash
     0.1%   558Ki   0.1%   249Ki    [24 Others]
     0.0%   364Ki   0.0%       0    .debug_str
     0.0%       0   0.2%   363Ki    .bss
     0.0%   261Ki   0.1%   261Ki    .gnu.version
   100.0%   970Mi 100.0%   208Mi    TOTAL

With --compress-debug-sections=zlib but not --gdb-index (so the huge not-so-useful .debug_gnu_pubnames is compressed)

  % hyperfine --warmup 2 --min-runs 10 "numactl -C 20-27 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib"
  Benchmark 1: numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib
    Time (mean ± σ):     10.756 s ±  0.025 s    [User: 10.797 s, System: 1.852 s]
    Range (min … max):   10.712 s … 10.791 s    10 runs

  Benchmark 2: numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib
    Time (mean ± σ):      5.487 s ±  0.047 s    [User: 10.964 s, System: 1.830 s]
    Range (min … max):    5.403 s …  5.559 s    10 runs

  Summary
    'numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib' ran
      1.96 ± 0.02 times faster than 'numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib'

With --gdb-index

  % hyperfine --warmup 2 --min-runs 10 "numactl -C 20-27 "{/tmp/c/0,/tmp/c/1}" -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib --gdb-index"
  Benchmark 1: numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib --gdb-index
    Time (mean ± σ):      6.981 s ±  0.020 s    [User: 9.516 s, System: 1.979 s]
    Range (min … max):    6.945 s …  7.015 s    10 runs

  Benchmark 2: numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib --gdb-index
    Time (mean ± σ):      5.350 s ±  0.037 s    [User: 9.623 s, System: 1.935 s]
    Range (min … max):    5.293 s …  5.399 s    10 runs

  Summary
    'numactl -C 20-27 /tmp/c/1 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib --gdb-index' ran
      1.30 ± 0.01 times faster than 'numactl -C 20-27 /tmp/c/0 -flavor gnu @response.txt --threads=8 -o lld --compress-debug-sections=zlib --gdb-index'

> Another direction to go could be to do compressed data concatenation - if the compression algorithm supports concatenation, you could lose some size benefits and gain speed (like lld's sliding scale of string deduplication) by just concatenating the compressed sections together - predictable size and you could write the updated compressed section header based on the input sections headers.

The concatenation approach is what used here :)

> Though I guess most of the DWARF sections remaining in the objects/linked binary when using Split DWARF require relocations to be applied, so that requires decompressing/recompressing anyway... :/

The end of https://maskray.me/blog/2022-01-23-compressed-debug-sections#linkers discusses why not allocating a buffer is tricky and is not generic enough.
Updating section headers afterwards has an issue that the output file size is unknown so cannot mmap the output in a read-write way.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D117853/new/

https://reviews.llvm.org/D117853