[llvm-dev] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

Wed Jul 8 09:40:03 PDT 2020

> To be clear, we're talking about replacing the runtime allocator for clang/LLD/etc., right

This is my understanding. I want to ensure that the CRT debug allocator remains optionally and on by default for debug builds so that I can use it to troubleshoot memory corruption issues in clang/LLVM/etc itself. The alternative would be instrumenting debug builds of LLVM with asan to provide similar benefits.

If I’m reading downthread correctly, it takes something like 40 minutes to link clang.exe with LLD using LTO if LLD is using the CRT allocator, and something like 3 minutes if LLD is using some other allocator. Assuming these numbers are correct, and something wasn’t wrong with the LLD built with the CRT allocator, then this certainly seems like a compelling reason to switch allocators. However, I doubt anybody is trying to use an LLD built in debug mode on windows to link clang.exe with LTO. I imagine it’d take an actual day to finish that build. The main use for a clang.exe built in debug mode on windows is to build small test programs and lit tests and such with a debugger attached. For this use case, I believe that the CRT debug allocator is the correct choice.

As a side note, these number seem very fishy to me. While it’s tempting to say that “malloc is a black box. I ask for a pointer, I get a pointer. I shouldn’t have to know what it does internally”, and just replace the allocator, I feel like maybe this merits investigation. Why are we allocating so much? Perhaps we should try to find ways to reduce the number of allocations? Are we doing something silly like creating a new std::vector in ever iteration of an inner loop somewhere? If we have tons of unnecessary allocations, we potentially could speed up LLD on all platforms. 3 minutes is still a really long time. If we could get that down to 30 seconds, that would be amazing. I keep hearing that each new version of LLVM takes longer to compile than the last. Perhaps it is time for us to figure out why? Maybe it’s lots of unnecessary allocations?

Thanks,
   Christopher Tetreault

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Mitch Phillips via llvm-dev
Sent: Tuesday, July 7, 2020 3:03 PM
To: Zachary Turner <zturner at roblox.com>
Cc: LLVM Dev <llvm-dev at lists.llvm.org>; cfe-dev at lists.llvm.org
Subject: [EXT] Re: [llvm-dev] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

> If I use clang with -fsanitize=address to build my program, and then run my program, what difference does it make for the execution of my program whether the compiler itself was instrumented or not

Yes, it doesn't make a difference to your final executable whether the compiler was built with ASan or not.

> Do you mean that ASAN runtime itself should be instrumented, since your program loads that at runtime?

Sanitizer runtimes aren't instrumented with sanitizers :).

-------

To be clear, we're talking about replacing the runtime allocator for clang/LLD/etc., right? We're not talking about replacing the default allocator for -O0 executables?

In either instance, using the ASan allocator (for either clang or executables) is possible, but won't provide any of the bug detection capabilities you describe without also ensuring that clang/your executable is built with ASan instrumentation (-fsanitize=address implies both "replace my allocator" and "instrument my code").

On Tue, Jul 7, 2020 at 2:53 PM Zachary Turner <zturner at roblox.com<mailto:zturner at roblox.com>> wrote:
I hadn't heard this before.  If I use clang with -fsanitize=address to build my program, and then run my program, what difference does it make for the execution of my program whether the compiler itself was instrumented or not?  Do you mean that ASAN runtime itself should be instrumented, since your program loads that at runtime?

On Tue, Jul 7, 2020 at 2:04 PM Mitch Phillips <mitchp at google.com<mailto:mitchp at google.com>> wrote:
Bearing in mind that the ASan allocator isn't particularly suited to detecting memory corruption unless you compile LLVM/Clang with ASan instrumentation as well. I don't imagine anybody would be proposing making the debug build for Windows be ASan-ified by default.

On Tue, Jul 7, 2020 at 1:49 PM Adrian McCarthy via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Asan and the Debug CRT take different approaches, but the problems they cover largely overlap.

Both help with detection of errors like buffer overrun, double free, use after free, etc.  Asan generally gives you more immediate feedback on those, but you pay a higher price in performance.  Debug CRT lets you do some trade off between the performance hit and how soon it detects problems.

Asan documentation says leak detection is experimental on Windows, while the Debug CRT leak detection is mature and robust (and can be nearly automatic in debug builds).  By adding a couple calls, you can do finer grained leak detection than checking what remains when the program exits.

Debug CRT lets you hook all of the malloc calls if you want, so you can extend it for your own types of tracking and bug detection.  But I don't think that feature is often used.

Windows's Appverifier is cool and powerful.  I cannot remember for sure, but I think some of its features might depend on the Debug CRT.  One thing it can do is simulate allocation failures so you can test your program's recovery code, but most programs nowadays assume memory allocation never fails and will just crash if it ever does.

On Tue, Jul 7, 2020 at 10:25 AM Zachary Turner via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Note that ASAN support is present on Windows now.  Does the Debug CRT provide any features that are not better served by ASAN?

On Tue, Jul 7, 2020 at 9:44 AM Chris Tetreault via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
For release builds, I think this is fine. However for debug builds, the Windows allocator provides a lot of built-in functionality for debugging memory issues that I would be very sad to lose. Therefore, I would request that:

  1.  This be added as a configuration option to either select the new allocator or the windows allocator
  2.  The Windows allocator be used by default in debug builds

Ideally, since you’re doing this work, you’d implement it in such a way that it’s fairly easy for anybody to use whatever allocator they want when building LLVM (on any platform, not just windows), and it’s not just hardcoded to system allocator vs whatever allocator ends up getting added. However, as long as I can use the windows debug allocator I’m happy.

Thanks,
   Christopher Tetreault

From: cfe-dev <cfe-dev-bounces at lists.llvm.org<mailto:cfe-dev-bounces at lists.llvm.org>> On Behalf Of Alexandre Ganea via cfe-dev
Sent: Wednesday, July 1, 2020 9:20 PM
To: cfe-dev at lists.llvm.org<mailto:cfe-dev at lists.llvm.org>; LLVM Dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>>
Subject: [EXT] [cfe-dev] RFC: Replacing the default CRT allocator on Windows

Hello,

I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.

The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.

We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance is an order of magnitude faster.

Time to link clang.exe with LLD and -flto on 36-core:
  Windows CRT heap allocator: 38 min 47 sec
  mimalloc: 2 min 22 sec
  rpmalloc: 2 min 15 sec
  snmalloc: 2 min 19 sec

We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream forks of LLVM that we can’t change.

Two questions arise:

  1.  The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
  2.  If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.

Please see demo patch here: https://reviews.llvm.org/D71786

Thank you in advance for the feedback!
Alex.

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200708/cd5778fb/attachment.html>