[llvm-dev] Multi-Threading Compilers

Thu Mar 26 11:55:51 PDT 2020

Hello everyone,

Just to add a bit of spice to the discussion about “Multi-Threading Compilers”: (sorry for just bringing high-level ideas)

We are heavy users of unity files (aka blobs or jumbos).

Unity files are a big pain, they add extra complexity, but at the same time they provide tremendous build-time reductions, 10x or more. Our game projects typically read >50,000 files during the full build of a single target, out of which 20,000 .CPPs. The same unity target compiles only 600 unity .CPPs, which themselves aggregate all of the 20,000 initial .CPPs. Building locally the 20,000 TUs on a modern 3.7 GHz 6-core PC takes more than 2 hours 30 min. With unity files, it takes 20 minutes. Distributing it remotely on pool of machines takes 5 min. Caching everything and rebuilding takes 45 sec.

However we’re now tributary of the order of files in the unities. If files or folders are added or removed in the codebase, the contents of the unity can change, thus the cache is invalidated for that unity CPP. And that happens quite often in production.

Unities also induce higher build times in some cases, spikes, like I was showing in a previous post of this thread. Without inspecting the AST, it is hard to determine an optimal “cutting” point when building the unity .CPPs. We can end up with unities including template-heavy .CPPs which will take a lot longer than other Unity files.

If we are to discuss multi-threading, this means we are discussing compile-time performance and how compilation would scale in the future. I think we should consider the functionality of unity files in the compiler (maybe behind a flag if it’s non-conformant).

While I don't know exactly how that fits in this (multi-treading) discussion, efficiently coalescing compilation of several TUs should be the compiler's responsibility, and likely will be more efficient than doing it by a pre-build tool, like we do today.

In essence, if we were to provide a large number of files to Clang, let's say with the same options: (the /MP flag is still WIP, I'll get back to that soon, [1])

                clang-cl /c a.cpp b.cpp c.cpp d.cpp ... /MP

And then expect the compiler to (somehow) share tokenization-lexing-filecaching-preprocessing-compilation-optims-computations-etc across TUs, in a lock-free manner preferably. Overlapped/duplicated computations across threads, in the manner of transactions, would be probably fine, if computations are small and if we want to avoid locks (but this needs to be profiled). Also the recent trend of NUMA processor “tiles” as well as HBM2 memory on-chip per “tile”, could change the way multi-threaded code is written. Perhaps states would need to be duplicated in the local NUMA memory for maximum performance. Additionally, I’m not sure (how/if) lock-based programming will scale past a few hundreds, or thousands of cores in a single image without major contention. Maybe, as long as locks don’t cross NUMA boundaries. This needs to be considered in the design.

So while the discussion seems to around multi-threading single TUs, it’d be nice to also consider the possibility of sharing state between TUs. Which maybe means retaining runtime state in global hash table(s). And possibly persisting that state on disk, or in a DB, after the compilation -- we could maybe draw a parallel with work done by SN Systems (Program Repo, see [2]), or zapcc [3].

Thanks!

Alex.

[1] https://reviews.llvm.org/D52193

[2] https://github.com/SNSystems/llvm-project-prepo

[3] https://github.com/yrnkrn/zapcc

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200326/11917793/attachment.html>