[PATCH] D117250: [lld-macho] Mention string literal deduplication as a difference from ld64

Thu Jan 13 17:09:57 PST 2022

MaskRay added a comment.

Non-string constant deduplication isn't that useful.
Tail string merge has very little benefit.

On the ELF land, I have some brief notes on https://maskray.me/blog/2021-12-19-why-isnt-ld.lld-faster#shf_merge-duplicate-elimination and compressed debug info.
ld.lld -O1 (default) performs SHF_MERGE|SHF_STRINGS deduplication.  It has a huge impact on the size of `.debug_str`.

  % ld.lld @response.txt -o clang.0 -O0
  % ld.lld @response.txt -o clang.1 -O1
  % stat -c %s clang.0 clang.1
  2126774248
  1389546048

  % ~/projects/bloaty/Release/bloaty clang.0 -- clang.1
      FILE SIZE        VM SIZE    
   --------------  -------------- 
    +286%  +661Mi  [ = ]       0    .debug_str
     +87% +41.3Mi   +87% +41.3Mi    .rodata
   +18e4%  +266Ki  [ = ]       0    .comment
    +0.0%      +8  [ = ]       0    .eh_frame
    -0.0%      -5  [ = ]       0    .debug_line
     +53%  +703Mi   +16% +41.3Mi    TOTAL

  % hyperfine --warmup 2 --min-runs 10 "numactl -C 20-27 /tmp/out/custom2/bin/ld.lld "{-O0,-O1}" @response.txt --threads=8 -o clang"                                                                                                                                                   
  Benchmark 1: numactl -C 20-27 /tmp/out/custom2/bin/ld.lld -O0 @response.txt --threads=8 -o clang
   ⠧ Current estimate: 4.992 s 
    Time (mean ± σ):      5.006 s ±  0.032 s    [User: 5.289 s, System: 3.048 s]
    Range (min … max):    4.958 s …  5.079 s    10 runs

  Benchmark 2: numactl -C 20-27 /tmp/out/custom2/bin/ld.lld -O1 @response.txt --threads=8 -o clang
    Time (mean ± σ):      6.030 s ±  0.044 s    [User: 11.633 s, System: 2.822 s]
    Range (min … max):    5.936 s …  6.066 s    10 runs

  Summary
    'numactl -C 20-27 /tmp/out/custom2/bin/ld.lld -O0 @response.txt --threads=8 -o clang' ran
      1.20 ± 0.01 times faster than 'numactl -C 20-27 /tmp/out/custom2/bin/ld.lld -O1 @response.txt --threads=8 -o clang'

.debug_str is ~3.86x (1+286%=3.86) as large if you suppress deduplication.

There are users preferring size and users preferring speed.
If you do parallelism on string deduplication, the speed may not differ too much.
(I have tried poor man's concurrent hash map <https://gist.github.com/MaskRay/4f274c978df684c870aec0254f844487>, but don't find a noticeable improvement.)

---

Perhaps I can contribute to the parallel part of DeduplicatedCStringSection::finalizeContents? :)
If I can make my cbdr work (https://reviews.llvm.org/D114735#3236110). Currently it seems to always print the help message

  % cbdr -V  
  cbdr 0.2.3
  Tools for comparitive benchmarking

  USAGE:
      cbdr <SUBCOMMAND>

  FLAGS:
      -h, --help       Prints help information
      -V, --version    Prints version information

  SUBCOMMANDS:
      analyze    For each pair of benchmarks (x and y), shows, for each metric (̄x and ̄y), the CI of (̄y - ̄x) / ̄x
      help       Prints this message or the help of the given subcommand(s)
      plot       Takes CSV data on stdin and produces a vega-lite plot specification on stdout
      sample     Repeatedly runs benchmarks chosen at random and prints results as CSV

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D117250/new/

https://reviews.llvm.org/D117250