[PATCH] D71775: [ThreadPool] On Windows, extend usage to all CPU sockets and all NUMA groups

Fri Dec 20 09:20:26 PST 2019

aganea created this revision.
aganea added reviewers: mehdi_amini, rnk, tejohnson, russell.gallop, dexonsmith.
Herald added subscribers: llvm-commits, cfe-commits, usaxena95, dang, jfb, kadircet, arphaman, steven_wu, jkorous, MaskRay, javed.absar, hiraditya, kristof.beyls, arichardson, emaste.
Herald added a reviewer: JDevlieghere.
Herald added a reviewer: espindola.
Herald added projects: clang, LLVM.

**TL;DR:** This patches ensures that, on Windows, all CPU sockets and all NUMA nodes are used by the `ThreadPool`. The goal was to have LLD/ThinLTO use all hardware threads in the system, which isn't the case currently on multi-socket or large CPU count systems.

(this could possibly be split into a few patches, but I just wanted an overall opinion)

Background
----------

Windows doesn't have a flat `cpu_set_t` like Linux. Instead, it projects hardware CPUs (or NUMA nodes) to applications through a concept of "processor groups". A "processor" is the smallest unit of execution on a CPU, that is, an hyper-thread if SMT is active; a core otherwise. There's a limit of 32-bit processors on older 32-bit versions of Windows, which later was raised to 64-processors with 64-bit versions of Windows. This limit comes from the affinity mask, which historically was represented by the `sizeof(void*)` (still is that way). Consequently, the concept of "processor groups" was introduced for dealing with systems with more than 64 hyper-threads.

By default, the Windows OS assigns only one "processor group" to each starting application, in a round-robin manner. If the application wants to use more processors, it needs to programmatically enable it, by assigning threads to other "processor groups". This also means that affinity cannot cross "processor group" boundaries; one can only specify a "preferred" group on startup, but the application is free to allocate more groups if it wants to.

This creates a peculiar situation, where newer CPUs like the AMD EPYC 7702P (64-cores, 128-hyperthreads) are projected by the OS as two (2) "processor groups". This means that by default, an application can only use half of the cores. This situation will only get worse in the years to come, as dies with more cores will appear on the market.

The changes in this patch
-------------------------

Previously, the problem was that `heavyweight_hardware_concurrency()` API was introduced so that only one hardware thread per core was used. Once that API returns, //that original intention is lost//. Consider a situation, on Windows, where the system has 2 CPU sockets, 18 cores each, each core having 2 hyper-threads, for a total of 72 hyper-threads. Both `heavyweight_hardware_concurrency()` and `hardware_concurrency()` currently return 36, because on Windows they are simply wrappers over `std::thread::hardware_concurrency()` -- which returns only processors from the current "processor group".

What if we wanted to use all "processor groups" ? Even if we implemented properly `heavyweight_hardware_concurrency()` to returns 18, what should it then return ? 18 or 36 ?
What if user specified `/opt:lldltojobs=36` ? Should we assign 36 threads on the current "processor group" or should we dispatch extra threads on the second "processor groups" ?

To solve this situation, we capture (and retain) the initial intention until the point of usage, through a new `ThreadPoolStrategy` class. The number of threads to use is deferred as late as possible, until the moment where the `std::thread`s are created (`ThreadPool` in the case of ThinLTO).

Discussion
----------

Ideally, we should consider all "processors" (on Windows) or all "CPUs" (Linux) as all equal, in which case `heavyweight_hardware_concurrency()` wouldn't be needed. I'm not sure how micro-managing threads, cores and NUMA nodes will scale in the years to come (probably not well). Will it make sense to say "I don't want hyper-threads" ? Or saying `/opt:lldltojobs=whatever` when you have a thousand(s)-core system ? How would that work with NUMA affinity ? For example the Fujitsu A64FX <https://hexus.net/tech/news/cpu/121382-fujitsu-reveals-a64fx-arm-based-supercomputer-cpu/> has 4x "12-core tiles" on the same die, each tile being connected to an internal 8-GB HBM2 memory (each located internally on the CPU die). How would we dispatch threads in that case ? The AMD EPYC uses the same concept of "tiles", however it doesn't have internal memory yet, but most likely the EPYC v3 will use the same architecture.

@tejohnson : Teresa, since you added `heavyweight_hardware_concurrency()`, do you have a benchmark which compares ThinLTO running with `heavyweight_hardware_concurrency()` or with `hardware_concurrency()` ? (I haven't done that test yet)
It would make things a lot simpler if we didn't have that API, and in general considered that we could use all hardware threads in the system; and that they can perform equally.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D71775

Files:
  clang-tools-extra/clang-doc/tool/ClangDocMain.cpp
  clang-tools-extra/clangd/TUScheduler.cpp
  clang-tools-extra/clangd/index/Background.cpp
  clang-tools-extra/clangd/index/Background.h
  clang-tools-extra/clangd/index/BackgroundRebuild.h
  clang/lib/Tooling/AllTUsExecution.cpp
  clang/lib/Tooling/DependencyScanning/DependencyScanningFilesystem.cpp
  clang/tools/clang-scan-deps/ClangScanDeps.cpp
  lld/ELF/SyntheticSections.cpp
  llvm/include/llvm/ADT/BitVector.h
  llvm/include/llvm/ADT/SmallBitVector.h
  llvm/include/llvm/LTO/LTO.h
  llvm/include/llvm/Support/ThreadPool.h
  llvm/include/llvm/Support/Threading.h
  llvm/lib/CodeGen/ParallelCG.cpp
  llvm/lib/ExecutionEngine/Orc/LLJIT.cpp
  llvm/lib/LTO/LTO.cpp
  llvm/lib/LTO/LTOBackend.cpp
  llvm/lib/LTO/ThinLTOCodeGenerator.cpp
  llvm/lib/Support/Host.cpp
  llvm/lib/Support/Parallel.cpp
  llvm/lib/Support/ThreadPool.cpp
  llvm/lib/Support/Threading.cpp
  llvm/lib/Support/Unix/Threading.inc
  llvm/lib/Support/Windows/Threading.inc
  llvm/tools/dsymutil/DwarfLinker.cpp
  llvm/tools/dsymutil/dsymutil.cpp
  llvm/tools/gold/gold-plugin.cpp
  llvm/tools/llvm-cov/CodeCoverage.cpp
  llvm/tools/llvm-cov/CoverageExporterJson.cpp
  llvm/tools/llvm-cov/CoverageReport.cpp
  llvm/tools/llvm-lto2/llvm-lto2.cpp
  llvm/tools/llvm-profdata/llvm-profdata.cpp
  llvm/unittests/ADT/BitVectorTest.cpp
  llvm/unittests/Support/Host.cpp
  llvm/unittests/Support/TaskQueueTest.cpp
  llvm/unittests/Support/ThreadPool.cpp
  llvm/unittests/Support/Threading.cpp

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D71775.234910.patch
Type: text/x-patch
Size: 50659 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20191220/d66d32aa/attachment-0001.bin>