[all-commits] [llvm/llvm-project] 3b4d80: [ELF] Parallelize writes of different OutputSections

Wed Aug 24 09:40:21 PDT 2022

  Branch: refs/heads/main
  Home:   https://github.com/llvm/llvm-project
  Commit: 3b4d800911b52ae23da1a1e3f9105f53d8053397
      https://github.com/llvm/llvm-project/commit/3b4d800911b52ae23da1a1e3f9105f53d8053397
  Author: Fangrui Song <i at maskray.me>
  Date:   2022-08-24 (Wed, 24 Aug 2022)

  Changed paths:
    M lld/ELF/OutputSections.cpp
    M lld/ELF/OutputSections.h
    M lld/ELF/Writer.cpp
    M lld/test/ELF/arm-thumb-interwork-notfunc.s
    M lld/test/ELF/hexagon-jump-error.s
    M lld/test/ELF/linkerscript/overlapping-sections.s
    M llvm/include/llvm/Support/Parallel.h
    M llvm/lib/Support/Parallel.cpp

  Log Message:
  -----------
  [ELF] Parallelize writes of different OutputSections

We currently process one OutputSection at a time and for each OutputSection
write contained input sections in parallel. This strategy does not leverage
multi-threading well. Instead, parallelize writes of different OutputSections.

The default TaskSize for parallelFor often leads to inferior sharding. We
prepare the task in the caller instead.

* Move llvm::parallel::detail::TaskGroup to llvm::parallel::TaskGroup
* Add llvm::parallel::TaskGroup::execute.
* Change writeSections to declare TaskGroup and pass it to writeTo.

Speed-up with --threads=8:

* clang -DCMAKE_BUILD_TYPE=Release: 1.11x as fast
* clang -DCMAKE_BUILD_TYPE=Debug: 1.10x as fast
* chrome -DCMAKE_BUILD_TYPE=Release: 1.04x as fast
* scylladb build/release: 1.09x as fast

On M1, many benchmarks are a small fraction of a percentage faster. Mozilla showed the largest difference with the patch being about 1.03x as fast.

Differential Revision: https://reviews.llvm.org/D131247