[PATCH] D71786: RFC: [Support] On Windows, add optional support for rpmalloc

Alexandre Ganea via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Dec 20 14:18:52 PST 2019


aganea created this revision.
aganea added reviewers: rnk, tejohnson, russell.gallop, hans.
Herald added subscribers: llvm-commits, jfb, dexonsmith, hiraditya, krytarowski, arichardson, mehdi_amini, mgorny, emaste.
Herald added a reviewer: jfb.
Herald added a project: LLVM.

This patch optionally replaces the default Windows heap allocator with rpmalloc <https://github.com/mjansson/rpmalloc> (public domain licence).

The Windows heap is thread-safe by default, and the ThinLTO codegen does a lot of allocations in each thread:

F11111915: lld-link-thinlto-crt-alloc2.PNG <https://reviews.llvm.org/F11111915>

On many-core systems, this effectively blocks other threads to a point where only a very small fraction of the CPU time is used:

Before patch (here on Windows 10, build 1709):
F11111904: lld-link-thinlto.PNG <https://reviews.llvm.org/F11111904>

We can see that a whooping 80% of the CPU time is spend waiting (blocked) on other threads (780 sec / 72 cores = 10 sec total) (graph with D71775 <https://reviews.llvm.org/D71775> applied):

F11111970: 6140_etw_linklto_clang10mt_spintime.PNG <https://reviews.llvm.org/F11111970>

Threads are blocked waiting on the heap lock:

F11111972: 6140_etw_linklto_clang10mt_spintime2.PNG <https://reviews.llvm.org/F11111972>

The thread above is awaken by the heap lock being released in another thread:

F11111973: 6140_etw_linklto_clang10mt_spintime3.PNG <https://reviews.llvm.org/F11111973>

----

After patch:
F11111918: 6140_link_ThinLTO_clang10_MT.PNG <https://reviews.llvm.org/F11111918>

After patch with D71775 <https://reviews.llvm.org/D71775> applied, all CPU sockets are used:
(the dark blue part of the graph represents the time spent in the kernel, see below why)
F11111922: 6140_link_ThinLTO_clang10_MT_rpmalloc_AllCores.PNG <https://reviews.llvm.org/F11111922>

In addition to the heap lock, there's a kernel bug <https://stackoverflow.com/questions/45024029/windows-10-poor-performance-compared-to-windows-7-page-fault-handling-is-not-sc> in some versions of of Windows 10, where accessing newly allocated virtual pages triggers the page zero-out mechanism, which itself is protected by a global lock, which further blocks memory allocations.

If we dig deeper, we effectively see `ExpWaitForSpinLockExclusiveAndAcquire` taking way too much time (the blue vertical lines show where it is called on the timeline):
(patch is applied)

F11111954: 6140_etw_linklto_rpmalloc_2.PNG <https://reviews.llvm.org/F11111954>

Comparing different builds of Windows, on the latest 1909, with this patch applied and D71775 <https://reviews.llvm.org/D71775>, ThinLTO linking now reaches almost 100% CPU usage:

F11112073: 6140_ThinLTO_1709_vs_1909.PNG <https://reviews.llvm.org/F11112073>

The feature can be enabled with the cmake flag `-DLLVM_ENABLE_RPMALLOC=ON -DLLVM_USE_CRT_RELEASE=MT`. It is currently available only for Windows, but rpmalloc already supports Darwin, FreeBSD, Linux so it would be easy to enable it for Unix as well. It currently uses /MT because it is easier that way, and I'm not sure /MD can be overloaded without code patching at runtime (I could investigate that later, but the DLL thunks slow things down).

---

Globally, this patch along with D71775 <https://reviews.llvm.org/D71775> gives more interesting link times with ThinLTO, on Windows at least. The link times below are for Ubisoft's Rainbow 6: Siege PC Final LTO build. Times are full link (no ThinLTO cache)
Clang 10 is a two-stage build.
In case [1] the second stage uses `-DLLVM_USE_CRT_RELEASE=MT`.
In case [2] the second stage uses `-DLLVM_ENABLE_RPMALLOC=ON -DLLVM_USE_CRT_RELEASE=MT`.
In case [3] the second stage uses `-DLLVM_ENABLE_RPMALLOC=ON -DLLVM_USE_CRT_RELEASE=MT -DLLVM_ENABLE_LTO=Thin`, and both `-DCMAKE_C_FLAGS -DCMAKE_CXX_FLAGS` set to `"/GS- -Xclang -O3 -Xclang -fwhole-program-vtables -fstrict-aliasing -march=skylake-avx512"`.
The Clang tests link with ThinLTO, while the MSVC tests evidently run full LTO.
I tested this on about 6 different systems, I will post more results later.

F11112116: ThinLTO_rpmalloc_1.png <https://reviews.llvm.org/F11112116>

F11112117: ThinLTO_rpmalloc_2.png <https://reviews.llvm.org/F11112117>


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D71786

Files:
  llvm/CMakeLists.txt
  llvm/include/llvm/Config/config.h.cmake
  llvm/lib/Support/CMakeLists.txt
  llvm/lib/Support/Windows/Memory.inc
  llvm/lib/Support/Windows/rpmalloc/LICENSE
  llvm/lib/Support/Windows/rpmalloc/malloc.c
  llvm/lib/Support/Windows/rpmalloc/rpmalloc.c
  llvm/lib/Support/Windows/rpmalloc/rpmalloc.h
  llvm/unittests/Support/DynamicLibrary/CMakeLists.txt

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D71786.234949.patch
Type: text/x-patch
Size: 125609 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191220/c99236fb/attachment-0001.bin>


More information about the llvm-commits mailing list