[PATCH] D136701: [LinkerWrapper] Perform device linking steps in parallel

Tue Oct 25 11:54:50 PDT 2022

tra added a comment.

In D136701#3883300 <https://reviews.llvm.org/D136701#3883300>, @jhuber6 wrote:

> However, as an opt-in feature it would be very helpful in some cases.

I'm OK with the explicit opt-in.

> Like consider someone creating a static library that supports every GPU architecture LLVM supports, it would be nice to be able to optionally turn on parallelism in the driver.

Yes, but the implicit assumption here is that you have sufficient resources. If you create N libraries, each for M architectures, your build machine may not have enough memory for N*M linkers.
Having N*M processes may or may not be an issue, but if each of the linkers is an `lld` which may want to run their own K parallel threads, it would not help anything.

In other words, I agree that it may be helpful in some cases, but I can also see how it may actually hurt the build, possibly catastrophically.

>   clang lib.c -fopenmp -O3 -fvisibility=hidden -foffload-lto -nostdlib --offload-arch=gfx700,gfx701,gfx801,gfx803,gfx900,gfx902,gfx906,gfx908,gfx90a,gfx90c,gfx940,gfx1010,gfx1030,gfx1031,gfx1032,gfx1033,gfx1034,gfx1035,gfx1036,gfx1100,gfx1101,gfx1102,gfx1103,sm_35,sm_37,sm_50,sm_52,sm_53,sm_60,sm_61,sm_62,sm_70,sm_72,sm_75,sm_80,sm_86
>
> This is something we might be doing more often as we start trying to provide standard library features on the GPU via static libraries. It might be wasteful to compile for every architecture but I think it's the soundest approach if we want compatibility.

My point is that grabbing resources will likely break build system's assumptions about their availability. How that would affect the build is anyone's guess. With infinite resources, parallel-everything would win, but in practice it's a big maybe. It would likely be a win for small builds and probably would be a wash or a regression for a larger build with multiple such targets.

Ideally it would be great if there would be a way to cooperate with the build system and let it manage the scheduling, but I don't think we have a good way of doing that. 
E.g. for CUDA compilation I was thinking of exposing per-GPU sub-compilations (well, we already do with --cuda-device-only/--cuda-device-only) and providing a way to create  combined object from them, and then let the build system manage how those per-GPU compilations would be launched. The problem there is that the build system would need to know our under-the-hood implementation details, so such an approach will be very fragile. The way the new driver does things may be a bit more suitable for this, but I suspect it would still be hard to do.

> `lld` already uses all available threads for its parallel linking, the linker wrapper runs before the host linker invocation so it shouldn't interfere either.

You do have a point here. As long as we don't end up with too many threads (e.g. we guarantee that per-offload linker instance does not run their own parallel threads, offload linking may be similar to parallel lld.

> This is only non-deterministic for the order of linking jobs between several targets and architectures. If the user only links a single architecture it should behave as before.

I'm not sure what you mean. Are you saying that linking with `--offload-arch=gfx700` is repeatable, but with `--offload-arch=gfx700,gfx701` it's not? That would still be a problem.

> The average case is still probably going to be one or two architectures at once, in which case this change won't make much of a difference.

Any difference is a difference, as far as content-based caching and provenance tracking is concerned.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136701/new/

https://reviews.llvm.org/D136701