<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.msonormal0, li.msonormal0, div.msonormal0
{mso-style-name:msonormal;
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
p.gmail-m1170423376599759767m-664316201346764609msolistparagraph, li.gmail-m1170423376599759767m-664316201346764609msolistparagraph, div.gmail-m1170423376599759767m-664316201346764609msolistparagraph
{mso-style-name:gmail-m_1170423376599759767m-664316201346764609msolistparagraph;
mso-margin-top-alt:auto;
margin-right:0cm;
mso-margin-bottom-alt:auto;
margin-left:0cm;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}
span.EmailStyle19
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
/* List Definitions */
@list l0
{mso-list-id:93862618;
mso-list-template-ids:-39416586;}
@list l0:level1
{mso-level-tab-stop:36.0pt;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level2
{mso-level-tab-stop:72.0pt;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level3
{mso-level-tab-stop:108.0pt;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level4
{mso-level-tab-stop:144.0pt;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level5
{mso-level-tab-stop:180.0pt;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level6
{mso-level-tab-stop:216.0pt;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level7
{mso-level-tab-stop:252.0pt;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level8
{mso-level-tab-stop:288.0pt;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l0:level9
{mso-level-tab-stop:324.0pt;
mso-level-number-position:left;
text-indent:-18.0pt;}
@list l1
{mso-list-id:889608890;
mso-list-template-ids:-1487913164;}
ol
{margin-bottom:0cm;}
ul
{margin-bottom:0cm;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-CA" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">Hi Adrian!<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">I completely agree with you, we should be clear on the wording. This proposal is about replacing both the MS CRT malloc layer *<b>AND</b>* the HeapAlloc layer.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal" style="margin-left:36.0pt"><span style="font-size:7.5pt;font-family:"Courier New"">C++ --> {mimalloc|rpmalloc|snmalloc} --> VirtualAlloc (Win32)</span><o:p></o:p></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">The bottom line is that libraries in LLVM allocate a lot, lots of small allocations. I think that should be solved on its own, perhaps by using BumpPtrAllocator a bit more. But this is fragile, any
new system in LLVM could later add lots of allocations, and induce heap locking further down the line.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">The MS CRT malloc layer is a very thin wrapper over HeapAlloc. But the issue we want to solve is contention at the HeapAlloc layer level. There’s only one heap by default for the entire process,
every thread allocating needs to “aquire” the heap, thus the lock. On the short term, I don’t see an easy way to solve this, except by bypassing HeapAlloc completely.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US">Reid mentioned that a Google toolchain/platform team experimented with tcmalloc3 (the one published here:
</span><a href="https://github.com/google/tcmalloc">https://github.com/google/tcmalloc</a> - not the one in gperftools) integrated into LLVM and got similar results to ones in the review below.<span style="mso-fareast-language:EN-US"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span style="mso-fareast-language:EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><b>De :</b> Adrian McCarthy <amccarth@google.com> <br>
<b>Envoyé :</b> July 6, 2020 11:52 AM<br>
<b>À :</b> Alexandre Ganea <alexandre.ganea@ubisoft.com><br>
<b>Cc :</b> James Y Knight <jyknight@google.com>; LLVM Dev <llvm-dev@lists.llvm.org>; Clang Dev <cfe-dev@lists.llvm.org><br>
<b>Objet :</b> Re: [llvm-dev] [cfe-dev] RFC: Replacing the default CRT allocator on Windows<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal">Let's make sure we're all working from the set set of assumptions. There are several layers of allocators in a Windows application, and I think it's worth being explicit, since I've already seen comments referring to multiple different
"allocators."<o:p></o:p></p>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">The C++ allocators and new and delete operators are generally built on malloc from the C run-time library. malloc implementations typically rely on process heaps (Win32 HeapAlloc) and/or go directly to virtual memory (Win32 VirtualAlloc).
HeapAlloc itself uses VirtualAlloc.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:7.5pt;font-family:"Courier New"">C++ --> malloc (CRT) --> HeapAlloc (Win32) --> VirtualMalloc (Win32)</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:7.5pt;font-family:"Courier New""> | ^</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><span style="font-size:7.5pt;font-family:"Courier New""> +---------------------------------------+</span><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">This proposal is talking about replacing the malloc layer.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">The "low-fragmentation heap" and "segment heap" are features of process heaps (HeapAlloc).<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Since those are in different layers, there's some orthogonality there. If your malloc implementation uses HeapAlloc, then tweaking process heap features may affect performance assuming the bottleneck isn't in malloc itself. If you replace
the malloc implementation with one that completely bypasses the process heap by going directly to virtual memory, that's a horse of a different color.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">The only Windows app I worked on that switched malloc implementations was a cross-platform app. On every platform, tcmalloc was a win,
<i>except </i>for Windows. We kept Windows on the default Microsoft implementation because it performed better. (The app was a multithreaded real-time graphics simulation. The number of threads was low, maybe 4 or 5.)<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Thu, Jul 2, 2020 at 8:57 PM Alexandre Ganea via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Thanks for the suggestion James, it reduces the commit by about ~900 MB (14,9 GB -> 14 GB).<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Unfortunately it does not solve the performance problem. The heap is global to the application and thread-safe, so every malloc/free locks it, which evidently doesn’t scale. We
could manually create thread-local heaps, but I didn’t want to go there. Ultimately allocated blocks need to share ownership between threads, and at that point it’s like re-writing a new allocator. I suppose most non-Windows platforms already have lock-free
thread-local arenas, which probably explains why this issue has gone (mostly) unnoticed.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><b>De :</b> James Y Knight <<a href="mailto:jyknight@google.com" target="_blank">jyknight@google.com</a>>
<br>
<b>Envoyé :</b> July 2, 2020 6:08 PM<br>
<b>À :</b> Alexandre Ganea <<a href="mailto:alexandre.ganea@ubisoft.com" target="_blank">alexandre.ganea@ubisoft.com</a>><br>
<b>Cc :</b> Clang Dev <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>>; LLVM Dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>><br>
<b>Objet :</b> Re: [cfe-dev] RFC: Replacing the default CRT allocator on Windows<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Have you tried Microsoft's new "segment heap" implementation? Only apps that opt-in get it at the moment. Reportedly edge and chromium are getting large memory savings from switching,
but I haven't seen performance comparisons.<o:p></o:p></p>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">If the performance is good, seems like that might be the simplest choice <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><a href="https://docs.microsoft.com/en-us/windows/win32/sbscs/application-manifests#heaptype" target="_blank">https://docs.microsoft.com/en-us/windows/win32/sbscs/application-manifests#heaptype</a><o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"><a href="https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals.pdf" target="_blank">https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals.pdf</a><o:p></o:p></p>
</div>
</div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">On Thu, Jul 2, 2020, 12:20 AM Alexandre Ganea via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0cm 0cm 0cm 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0cm;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Hello,<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link
times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance
is an order of magnitude faster.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Time to link clang.exe with LLD and -flto on 36-core:<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> Windows CRT heap allocator: 38 min 47 sec<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> mimalloc: 2 min 22 sec<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> rpmalloc: 2 min 15 sec<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> snmalloc: 2 min 19 sec<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream
forks of LLVM that we can’t change.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Two questions arise:<o:p></o:p></p>
<ol start="1" type="1">
<li class="gmail-m1170423376599759767m-664316201346764609msolistparagraph" style="mso-list:l0 level1 lfo3">
The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?
<o:p></o:p></li><li class="gmail-m1170423376599759767m-664316201346764609msolistparagraph" style="mso-list:l0 level1 lfo3">
If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.<o:p></o:p></li></ol>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Please see demo patch here:
<a href="https://reviews.llvm.org/D71786" target="_blank">https://reviews.llvm.org/D71786</a><o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Thank you in advance for the feedback!<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Alex.<o:p></o:p></p>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
</div>
<p class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">_______________________________________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><o:p></o:p></p>
</blockquote>
</div>
</div>
</div>
<p class="MsoNormal">_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>
</blockquote>
</div>
</div>
</body>
</html>