[PATCH] D117853: [ELF] Parallelize --compress-debug-sections=zlib

Mon Jan 24 18:14:30 PST 2022

dblaikie added a comment.

In D117853#3268050 <https://reviews.llvm.org/D117853#3268050>, @MaskRay wrote:

> In D117853#3268030 <https://reviews.llvm.org/D117853#3268030>, @dblaikie wrote:
>
>> In D117853#3268012 <https://reviews.llvm.org/D117853#3268012>, @MaskRay wrote:
>>
>>> In D117853#3267965 <https://reviews.llvm.org/D117853#3267965>, @dblaikie wrote:
>>>
>>>> In D117853#3261870 <https://reviews.llvm.org/D117853#3261870>, @MaskRay wrote:
>>>>
>>>>> In D117853#3261856 <https://reviews.llvm.org/D117853#3261856>, @dblaikie wrote:
>>>>>
>>>>>> Is there any chance to avoid buffering the compressed output? (I guess probably not, because you need to know how large it is before you write it to the output file (if you want to parallelize writing sections, which is important no doubt))
>>>>>
>>>>> I have asked myself this question... Unfortunately no. To have accurate estimate of sizes, we have to buffer all compressed output.
>>>>> It's needed to compute sh_offset and sh_size fields of a .debug_* section. To know the size we need to compress it first (or estimate, but the compression ratio is not easy to estimate).
>>>>>
>>>>> I think pigz uses an approach to only keep `concurrency` shards, but it does not have the requirement to know the output size beforehand.
>>>>
>>>> Yeah, I guess out of scope for this change - but maybe another time. It'd break parallelism, but you could stream out a section at a time (at least for the compressed sections) and then seek back to write the sh* offset fields based on how the compression actually worked out.
>>>>
>>>> I guess for Split DWARF the memory savings wouldn't be that significant, though? Do you have a sense of how much memory it'd take.
>>>
>>> The saving is still large because of .debug_line.
>>
>> I mostly meant the memory savings that might be available if we could avoid caching compressed debug info output sections - I guess looking at the numbers you posted, assuming lld's internal data structures don't use much memory compared to the output size & assuming you're writing to tmpfs so the output counts as memory usage - that's still like half the output file size again as memory usage for compressed output section buffers, so a possible 30% reduction in memory usage or so... which seems pretty valuable, but hard to achieve for sure.
>
> There will be some memory savings but I am speculating that it is small.
> My rationale is that `zlib::compress` allocates a compressed buffer whose size is a bit larger than the input size (zlib `deflateBound`).
> (This is actually a saving many projects do not realize (jdk,ffmpeg,etc))
> This patch switches to half by default but **I see a very small memory usage decrease** (I don't remember clearly, but definitely less than 2%).
> So I speculate that even if I drop the output buffer entirely, the saving won't be large.
> The likely reason is that the memory just overlaps some data structures allocated by previous passes.
> I haven't use a heap profiler to look into it more deeply.

Yeah, might be interesting to know where peak linker memory usage is - if this isn't at the peak point, that's fair - less to worry about.

>>>> Another direction to go could be to do compressed data concatenation - if the compression algorithm supports concatenation, you could lose some size benefits and gain speed (like lld's sliding scale of string deduplication) by just concatenating the compressed sections together - predictable size and you could write the updated compressed section header based on the input sections headers.
>>>
>>> The concatenation approach is what used here :)
>>
>> Ah, sorry, I meant concatenation of the input sections - no need to decompress or recompress, but that only applies if there are no relocations or other changes to apply to the data.
>
> Oh, you mean compressing input sections individually and than concatenating them.
> I've thought about this.
> One big issue is that initializating zlib data structures takes time.
> If we create z_stream one for every input section, the overhead may be too high.

Ah, sorry, no, I meant taking the already-compressed input sections and writing them straight to the output without the linker ever decompressing or compressing this data. Which, yeah, only applies if there are no relocations to apply - which is more relevant with dwp (where I mostly have in mind) than with lld (if you're using Split DWARF - if you're not using Split DWARF but you are using DWARFv5, there might be more opportunities for DWARF sections that have no relocations), though some sections even with Split DWARF have no relocations, like .debug_rnglists for instance.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D117853/new/

https://reviews.llvm.org/D117853