<div dir="ltr"><div dir="auto"><div dir="auto"><a href="https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals-wp.pdf" rel="noreferrer" target="_blank">https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals-wp.pdf</a> seems to be the paper that goes with the sides I linked before. It says that there's some sort of adaptive mechanism that allocates per-CPU "affinity slot" if it detects lots of lock contention. Which seems like it <i>ought</i> to have good multithreaded behavior.<br></div><div dir="auto"><br></div><div>I see in your initial email that the sample backtrace is in "free", not allocate. Is that just an example, or is "free" where effectively all the contention is? If the latter, I wonder if we're hitting some pathological edge-case...e.g. allocating on one thread, and then freeing on different threads, or something along those lines.</div><div><br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jul 2, 2020, 11:56 PM Alexandre Ganea <<a href="mailto:alexandre.ganea@ubisoft.com" rel="noreferrer" target="_blank">alexandre.ganea@ubisoft.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div lang="EN-CA">

<div>

<p class="MsoNormal"><span>Thanks for the suggestion James, it reduces the commit by about ~900 MB (14,9 GB -> 14 GB).<u></u><u></u></span></p>

<p class="MsoNormal"><span><u></u> <u></u></span></p>

<p class="MsoNormal"><span>Unfortunately it does not solve the performance problem. The heap is global to the application and thread-safe, so every malloc/free locks it, which evidently doesn’t scale. We could manually create

 thread-local heaps, but I didn’t want to go there. Ultimately allocated blocks need to share ownership between threads, and at that point it’s like re-writing a new allocator. I suppose most non-Windows platforms already have lock-free thread-local arenas,

 which probably explains why this issue has gone (mostly) unnoticed.<u></u><u></u></span></p>

<p class="MsoNormal"><span><u></u> <u></u></span></p>

<p class="MsoNormal"><span><u></u> <u></u></span></p>

<p class="MsoNormal"><b><span lang="FR">De :</span></b><span lang="FR"> James Y Knight <<a href="mailto:jyknight@google.com" rel="noreferrer noreferrer" target="_blank">jyknight@google.com</a>>

<br>

<b>Envoyé :</b> July 2, 2020 6:08 PM<br>

<b>À :</b> Alexandre Ganea <<a href="mailto:alexandre.ganea@ubisoft.com" rel="noreferrer noreferrer" target="_blank">alexandre.ganea@ubisoft.com</a>><br>

<b>Cc :</b> Clang Dev <<a href="mailto:cfe-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">cfe-dev@lists.llvm.org</a>>; LLVM Dev <<a href="mailto:llvm-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">llvm-dev@lists.llvm.org</a>><br>

<b>Objet :</b> Re: [cfe-dev] RFC: Replacing the default CRT allocator on Windows<u></u><u></u></span></p>

<p class="MsoNormal"><u></u> <u></u></p>

<div>

<p class="MsoNormal">Have you tried Microsoft's new "segment heap" implementation? Only apps that opt-in get it at the moment. Reportedly edge and chromium are getting large memory savings from switching, but I haven't seen performance comparisons.<u></u><u></u></p>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal">If the performance is good, seems like that might be the simplest choice <u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal"><a href="https://docs.microsoft.com/en-us/windows/win32/sbscs/application-manifests#heaptype" rel="noreferrer noreferrer" target="_blank">https://docs.microsoft.com/en-us/windows/win32/sbscs/application-manifests#heaptype</a><u></u><u></u></p>

</div>

<div>

<p class="MsoNormal"><u></u> <u></u></p>

</div>

<div>

<p class="MsoNormal"><a href="https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals.pdf" rel="noreferrer noreferrer" target="_blank">https://www.blackhat.com/docs/us-16/materials/us-16-Yason-Windows-10-Segment-Heap-Internals.pdf</a><u></u><u></u></p>

</div>

</div>

<p class="MsoNormal"><u></u> <u></u></p>

<div>

<div>

<p class="MsoNormal">On Thu, Jul 2, 2020, 12:20 AM Alexandre Ganea via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">cfe-dev@lists.llvm.org</a>> wrote:<u></u><u></u></p>

</div>

<blockquote style="border-top:none;border-right:none;border-bottom:none;border-left:1pt solid rgb(204,204,204);padding:0cm 0cm 0cm 6pt;margin-left:4.8pt;margin-right:0cm">

<div>

<div>

<p class="MsoNormal"><span lang="FR-CA">Hello,</span><u></u><u></u></p>

<p class="MsoNormal"><span lang="FR-CA"> </span><u></u><u></u></p>

<p class="MsoNormal">I was wondering how folks were feeling about replacing the default Windows CRT allocator in Clang, LLD and other LLVM tools possibly.<u></u><u></u></p>

<p class="MsoNormal"> <u></u><u></u></p>

<p class="MsoNormal">The CRT heap allocator on Windows doesn’t scale well on large core count machines. Any multi-threaded workload in LLVM that allocates often is impacted by this. As a result, link

 times with ThinLTO are extremely slow on Windows. We’re observing performance inversely proportional to the number of cores. The more cores the machines has, the slower ThinLTO linking gets.<u></u><u></u></p>

<p class="MsoNormal"> <u></u><u></u></p>

<p class="MsoNormal">We’ve replaced the CRT heap allocator by modern lock-free thread-cache allocators such as rpmalloc (unlicence), mimalloc (MIT licence) or snmalloc (MIT licence). The runtime performance

 is an order of magnitude faster.<u></u><u></u></p>

<p class="MsoNormal"> <u></u><u></u></p>

<p class="MsoNormal">Time to link clang.exe with LLD and -flto on 36-core:<u></u><u></u></p>

<p class="MsoNormal">  Windows CRT heap allocator: 38 min 47 sec<u></u><u></u></p>

<p class="MsoNormal">  mimalloc: 2 min 22 sec<u></u><u></u></p>

<p class="MsoNormal">  rpmalloc: 2 min 15 sec<u></u><u></u></p>

<p class="MsoNormal">  snmalloc: 2 min 19 sec<u></u><u></u></p>

<p class="MsoNormal"> <u></u><u></u></p>

<p class="MsoNormal">We’re running in production with a downstream fork of LLVM + rpmalloc for more than a year. However when cross-compiling some specific game platforms we’re using other downstream

 forks of LLVM that we can’t change.<u></u><u></u></p>

<p class="MsoNormal"> <u></u><u></u></p>

<p class="MsoNormal">Two questions arise:<u></u><u></u></p>

<ol start="1" type="1">

<li>

The licencing. Should we embed one of these allocators into the LLVM tree, or keep them separate out-of-the-tree?

<u></u><u></u></li><li>

If the answer for above question is “yes”, given the tremendous performance speedup, should we embed one of these allocators into Clang/LLD builds by default? (on Windows only) Considering that Windows doesn’t have a LD_PRELOAD mechanism.<u></u><u></u></li></ol>

<p class="MsoNormal"> <u></u><u></u></p>

<p class="MsoNormal">Please see demo patch here:

<span lang="FR-CA"><a href="https://reviews.llvm.org/D71786" rel="noreferrer noreferrer" target="_blank"><span lang="EN-CA">https://reviews.llvm.org/D71786</span></a></span><u></u><u></u></p>

<p class="MsoNormal"> <u></u><u></u></p>

<p class="MsoNormal">Thank you in advance for the feedback!<u></u><u></u></p>

<p class="MsoNormal">Alex.<u></u><u></u></p>

<p class="MsoNormal"> <u></u><u></u></p>

</div>

</div>

<p class="MsoNormal">_______________________________________________<br>

cfe-dev mailing list<br>

<a href="mailto:cfe-dev@lists.llvm.org" rel="noreferrer noreferrer" target="_blank">cfe-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><u></u><u></u></p>

</blockquote>

</div>

</div>

</div>

</blockquote></div>