[PATCH] D68820: win: Move Parallel.h off concrt to cross-platform code

Thu May 14 14:44:23 PDT 2020

aganea added inline comments.

================
Comment at: llvm/lib/Support/Parallel.cpp:73
       for (size_t i = 1; i < ThreadCount; ++i) {
         std::thread([=] { work(); }).detach();
       }
----------------
rnk wrote:
> I've belatedly realized that this means that LLVM is doing thread management on its own, i.e. every linker invocation spawns `hardware_concurrency()` threads. My understanding is that ConCRT is built on the system worker thread pool, which helps prevent oversubscription of CPU resources.
> 
> While @aganea measured that this change improved benchmarks, this change could lead to bad throughput when multiple link jobs run concurrently. Today, LLD is not very parallel, but this may become more of an issue as we use more and more parallelism for debug info merging. At some point in the future, we should try measuring the impact of this change on the performance of three links running in parallel, and see if using the NT worker pool gives benefits in that case. For now, though, let's not get ahead of ourselves with unmeasured concerns and leave this as is.
One cheap alternative is to always use `heavyweight_hardware_concurrency()` by default, and let the user do `--threads=%NUMBER_OF_PROCESSORS%` if they want `hardware_concurrency()`.

In the absence of a global decision-maker, `heavyweight_hardware_concurrency()` is bit of a hack. Letting an external build system like Ninja doing that though static flags, ie `--threads` or `/opt:lldltojobs` doesn't work too well either. You can end up with large spans of time where nothing is happening, because that part of the application (LLD) isn't multi-threaded. Or because the `ThreadPool`'s jobs are cooling down, as below at time 100:

{F11923724}

I've tried increasing the number of threads, to see how it would react. It seems every extra ThinLTO thread above the my hardware threads, is adding roughly 150 ms to the execution. For example, running an input on 72 threads takes 108 sec, while the same input on 100 threads takes 113 sec. I don't know if the relation is linear, but it gives an idea. Probably context-switching between applications would be even more costly, I assume two lld-link running side-by-side using each 72 threads would cost even more.

I think a platform-independent solution is needed here. If we have several LLDs running, we could dynamically throttle the number of threads for each `ThreadPool`, through some kind of IPC. We "just" need to ensure there aren't more than N threads at one time, while taking into account: affinity, hyper-threading/cache affinity, core-local memory, and multi-socket machines.

How we would interface with Ninja? LLD wouldn't know how many free "lanes" Ninja has. Should we retain, increase, or remove `LLVM_PARALLEL_LINK_JOBS` ? We could build some kind of generic IPC API to be used in Ninja, but then what happens for build systems that don't implement it? make, Fastbuild, MSBuild, etc.

Another way would be to embed the the compiler & the linker into the build system (not necessarly in the way I was showing last year). There's value for doing so, one example is the usage of clang-scan-deps I was showing, it lets the build system extract dependency information very quickly, instead of invoking thousands of processes, while doing memoization as much as possible. The same thing can be achieved for pre-processing, compilation or linking. Lots of things to be done, not enough time :-)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D68820/new/

https://reviews.llvm.org/D68820