[PATCH] D136701: [LinkerWrapper] Perform device linking steps in parallel

Tue Oct 25 11:15:28 PDT 2022

jhuber6 added a comment.

In D136701#3883218 <https://reviews.llvm.org/D136701#3883218>, @tra wrote:

> I would argue that parallel compilation and linking may need to be disabled by default. I believe similar patches were discussed in the past regarding sub-compilations, but they are relevant for parallel linking, too.
> Google search shows D52193 <https://reviews.llvm.org/D52193>, but I believe there were other attempts in the past. 
> @yaxunl - I vaguely recall that we did discuss parallel HIP/CUDA compilation in the past, but I can't find the details.

I think parallel compilation might be desirable as well, but it's a harder sell than parallel linking in my opinion. However, as an opt-in feature it would be very helpful in some cases. Like consider someone creating a static library that supports every GPU architecture LLVM supports, it would be nice to be able to optionally turn on parallelism in the driver.

  clang lib.c -fopenmp -O3 -fvisibility=hidden -foffload-lto -nostdlib --offload-arch=gfx700,gfx701,gfx801,gfx803,gfx900,gfx902,gfx906,gfx908,gfx90a,gfx90c,gfx940,gfx1010,gfx1030,gfx1031,gfx1032,gfx1033,gfx1034,gfx1035,gfx1036,gfx1100,gfx1101,gfx1102,gfx1103,sm_35,sm_37,sm_50,sm_52,sm_53,sm_60,sm_61,sm_62,sm_70,sm_72,sm_75,sm_80,sm_86

This is something we might be doing more often as we start trying to provide standard library features on the GPU via static libraries. It might be wasteful to compile for every architecture but I think it's the soundest approach if we want compatibility.

> These days most of the builds are parallel already and it's very likely that the build system already launches as many jobs as there are CPUs available. Making each compilation launch multiple parallel subcompilations would likely result in way too many simultaneously running processes.
> Granted, linking is done less often than compilation, so having parallel linking may be lucky to be the last remaining process in the parallel build, but it's not unusual to have multiple linker processes running simultaneously during the build either. Linking is often the most resource-heavy part of the build, so I would not be surprised if even a few linker instances would cause problems if they spawn parallel sub-linking jobs.

`lld` already uses all available threads for its parallel linking, the linker wrapper runs before the host linker invocation so it shouldn't interfere either. My only concern is in the future we may try to support faster LTO linking via thin-LTO or some other parallel implementation. I think there's a reasonable precedent for parallel linking already.

> Having parallel subcompilations may be useful in some cases -- e.g. distributed compilation with one compilation per remote worker w/ multiple CPUs available on the worker, but that's unlikely to be a common scenario. 
> Having deterministic output is also very important, both for the build repeatability/provenance tracking and for the build system's cache hit rates. Reliably cached slow repeatable compilation will be a net win over fast, but unstable compilation that causes cache churn and triggers more things to be rebuilt.

This is only non-deterministic for the order of linking jobs between several targets and architectures. If the user only links a single architecture it should behave as before. The average case is still probably going to be one or two architectures at once, in which case this change won't make much of a difference.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136701/new/

https://reviews.llvm.org/D136701