[PATCH] D71786: RFC: [Support] On Windows, add optional support for rpmalloc
Alexandre Ganea via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Fri Dec 20 14:18:52 PST 2019
aganea created this revision.
aganea added reviewers: rnk, tejohnson, russell.gallop, hans.
Herald added subscribers: llvm-commits, jfb, dexonsmith, hiraditya, krytarowski, arichardson, mehdi_amini, mgorny, emaste.
Herald added a reviewer: jfb.
Herald added a project: LLVM.
This patch optionally replaces the default Windows heap allocator with rpmalloc <https://github.com/mjansson/rpmalloc> (public domain licence).
The Windows heap is thread-safe by default, and the ThinLTO codegen does a lot of allocations in each thread:
F11111915: lld-link-thinlto-crt-alloc2.PNG <https://reviews.llvm.org/F11111915>
On many-core systems, this effectively blocks other threads to a point where only a very small fraction of the CPU time is used:
Before patch (here on Windows 10, build 1709):
F11111904: lld-link-thinlto.PNG <https://reviews.llvm.org/F11111904>
We can see that a whooping 80% of the CPU time is spend waiting (blocked) on other threads (780 sec / 72 cores = 10 sec total) (graph with D71775 <https://reviews.llvm.org/D71775> applied):
F11111970: 6140_etw_linklto_clang10mt_spintime.PNG <https://reviews.llvm.org/F11111970>
Threads are blocked waiting on the heap lock:
F11111972: 6140_etw_linklto_clang10mt_spintime2.PNG <https://reviews.llvm.org/F11111972>
The thread above is awaken by the heap lock being released in another thread:
F11111973: 6140_etw_linklto_clang10mt_spintime3.PNG <https://reviews.llvm.org/F11111973>
----
After patch:
F11111918: 6140_link_ThinLTO_clang10_MT.PNG <https://reviews.llvm.org/F11111918>
After patch with D71775 <https://reviews.llvm.org/D71775> applied, all CPU sockets are used:
(the dark blue part of the graph represents the time spent in the kernel, see below why)
F11111922: 6140_link_ThinLTO_clang10_MT_rpmalloc_AllCores.PNG <https://reviews.llvm.org/F11111922>
In addition to the heap lock, there's a kernel bug <https://stackoverflow.com/questions/45024029/windows-10-poor-performance-compared-to-windows-7-page-fault-handling-is-not-sc> in some versions of of Windows 10, where accessing newly allocated virtual pages triggers the page zero-out mechanism, which itself is protected by a global lock, which further blocks memory allocations.
If we dig deeper, we effectively see `ExpWaitForSpinLockExclusiveAndAcquire` taking way too much time (the blue vertical lines show where it is called on the timeline):
(patch is applied)
F11111954: 6140_etw_linklto_rpmalloc_2.PNG <https://reviews.llvm.org/F11111954>
Comparing different builds of Windows, on the latest 1909, with this patch applied and D71775 <https://reviews.llvm.org/D71775>, ThinLTO linking now reaches almost 100% CPU usage:
F11112073: 6140_ThinLTO_1709_vs_1909.PNG <https://reviews.llvm.org/F11112073>
The feature can be enabled with the cmake flag `-DLLVM_ENABLE_RPMALLOC=ON -DLLVM_USE_CRT_RELEASE=MT`. It is currently available only for Windows, but rpmalloc already supports Darwin, FreeBSD, Linux so it would be easy to enable it for Unix as well. It currently uses /MT because it is easier that way, and I'm not sure /MD can be overloaded without code patching at runtime (I could investigate that later, but the DLL thunks slow things down).
---
Globally, this patch along with D71775 <https://reviews.llvm.org/D71775> gives more interesting link times with ThinLTO, on Windows at least. The link times below are for Ubisoft's Rainbow 6: Siege PC Final LTO build. Times are full link (no ThinLTO cache)
Clang 10 is a two-stage build.
In case [1] the second stage uses `-DLLVM_USE_CRT_RELEASE=MT`.
In case [2] the second stage uses `-DLLVM_ENABLE_RPMALLOC=ON -DLLVM_USE_CRT_RELEASE=MT`.
In case [3] the second stage uses `-DLLVM_ENABLE_RPMALLOC=ON -DLLVM_USE_CRT_RELEASE=MT -DLLVM_ENABLE_LTO=Thin`, and both `-DCMAKE_C_FLAGS -DCMAKE_CXX_FLAGS` set to `"/GS- -Xclang -O3 -Xclang -fwhole-program-vtables -fstrict-aliasing -march=skylake-avx512"`.
The Clang tests link with ThinLTO, while the MSVC tests evidently run full LTO.
I tested this on about 6 different systems, I will post more results later.
F11112116: ThinLTO_rpmalloc_1.png <https://reviews.llvm.org/F11112116>
F11112117: ThinLTO_rpmalloc_2.png <https://reviews.llvm.org/F11112117>
Repository:
rG LLVM Github Monorepo
https://reviews.llvm.org/D71786
Files:
llvm/CMakeLists.txt
llvm/include/llvm/Config/config.h.cmake
llvm/lib/Support/CMakeLists.txt
llvm/lib/Support/Windows/Memory.inc
llvm/lib/Support/Windows/rpmalloc/LICENSE
llvm/lib/Support/Windows/rpmalloc/malloc.c
llvm/lib/Support/Windows/rpmalloc/rpmalloc.c
llvm/lib/Support/Windows/rpmalloc/rpmalloc.h
llvm/unittests/Support/DynamicLibrary/CMakeLists.txt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D71786.234949.patch
Type: text/x-patch
Size: 125609 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20191220/c99236fb/attachment-0001.bin>
More information about the llvm-commits
mailing list