[PATCH] D96035: [WIP][dsymutil][DWARFlinker] implement separate multi-thread processing for compile units.

Tue Sep 7 03:16:15 PDT 2021

avl added a comment.

>>   Current size/performance results are(compared with llvm upstream dsymutil):
>>   
>>       .debug_info table of 40% less in size.
>
> Out of curiosity - what's the total size of the clang dsym file with/without this patch/feature?

dsymutil --use-dlnext clang
ls -lh clang.dSYM/Contents/Resources/DWARF/clang
758M

dsymutil clang
ls -lh clang.dSYM/Contents/Resources/DWARF/clang
955M

i.e. total size of the clang dsym file become 20% smaller with  --use-dlnext option.

>>   single-threaded mode works 1.7x slower.
>>   multy-thread mode works up to 2x faster.
>
> With how many threads? (& what sort of setup are the speed measurements done with - writing to (& maybe reading from) a fast SSD or temporary/RAM filesystem?)

Measurements were done on Darwin 24-core 64G system using regular disk(not SSD/ not temporary/RAM filesystem). ~2x improvement is for 16 cores.

  |----------------------------------------------------------------------
  |       |           dsymutil           |     dsymutil --use-dlnext    |
  |-------|------------------------------|------------------------------|
  |       |exec time|  memory  | DWARF(*)|exec time|  memory  |  DWARF  |
  |       |   sec   |    GB    |    MB   |   sec   |    GB    |   MB    |
  |-------|------------------------------|------------------------------|
  |threads|         |          |         |         |          |         |
  |-------|------------------------------|------------------------------|
  |   1   |   155   |   15.8   |   465   |   269   |   16.1   |   273   |
  |-------|------------------------------|------------------------------|
  |   2   |    99   |   17.5   |   465   |   154   |   16.1   |   273   |
  |-------|------------------------------|------------------------------|
  |   4   |    99   |   17.5   |   465   |    96   |   16.5   |   273   |
  |-------|------------------------------|------------------------------|
  |   8   |    99   |   17.5   |   465   |    65   |   16.5   |   273   |
  |-------|------------------------------|------------------------------|
  |  16   |    99   |   17.5   |   465   |    52   |   16.5   |   273   |
  |---------------------------------------------------------------------|

> Alternatively: perhaps the new architecture could present a different tradeoff: what's the performance of parallelism (on a fast storage) without any deduplication? Can it run faster but produce a >larger output (or is the output such a bottleneck that that isn't the case)?

For the case when type deduplication is not done the performance numbers(for regular storage) look better(if compared with no ODR case, the performance improvement seen starting from 2 cores):

  |----------------------------------------------------------------------
  |       |           dsymutil           |     dsymutil --use-dlnext    |
  |-------|------------------------------|------------------------------|
  |       |exec time|  memory  |  DWARF  |exec time|  memory  |  DWARF  |
  |       |   sec   |    GB    |    MB   |   sec   |    GB    |   MB    |
  |-------|------------------------------|------------------------------|
  |threads|         |          |         |         |          |         |
  |-------|------------------------------|------------------------------|
  |   1   |   224   |   15.9   |   1400  |   250   |   19.5   |   1400  |
  |-------|------------------------------|------------------------------|
  |   2   |   214   |   17.7   |   1400  |   144   |   19.5   |   1400  |
  |-------|------------------------------|------------------------------|
  |   4   |   214   |   17.7   |   1400  |    90   |   19.5   |   1400  |
  |-------|------------------------------|------------------------------|
  |   8   |   214   |   17.7   |   1400  |    62   |    20    |   1400  |
  |-------|------------------------------|------------------------------|
  |  16   |   214   |   17.7   |   1400  |    51   |    20    |   1400  |
  |---------------------------------------------------------------------|

output is not a bottleneck here. The execution time directly depends on the amount of source DWARF which should be analyzed/cloned(The upstream dsymutil in ODR deduplication mode skips analyzing/clonning for some dies thus it works faster. When it does not skip types, like in no ODR case, it has similar single thread performance).

Anyway, there is additional set of things which might improve performance/memory requirements/output size for all modes(single-thread/multi-thread/ODR/no-ODR).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D96035/new/

https://reviews.llvm.org/D96035