<div dir="ltr"><div dir="ltr"></div>Hi Wenlei,<div><br></div><div>Thanks for the comments! David answered the first question, I do have some comments on the second one though.</div><div>Teresa</div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jul 5, 2020 at 1:44 PM Xinliang David Li <<a href="mailto:davidxl@google.com">davidxl@google.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div dir="ltr"><div style="font-family:monospace;font-size:small;color:rgb(0,0,0)"><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <<a href="mailto:wenlei@fb.com" target="_blank">wenlei@fb.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div lang="EN-US">
<div>
<p class="MsoNormal">This sounds very useful. We’ve improved and used <a href="https://www.youtube.com/watch?v=fm47XsATelI" target="_blank">
memoro</a> for memory profiling and analysis, and we are also looking for ways to leverage memory profile for PGO/FDO. I think having a common profiling infrastructure for analysis tooling as well as profile guided optimizations is good design, and having it
in LLVM is also helpful. Very interested in the tooling and optimization that comes after the profiler.
<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Two questions: <u></u><u></u></p>
<ul style="margin-top:0in" type="disc">
<li style="margin-left:0in">How does the profiling overhead look? Is that similar to ASAN overhead from what you’ve seen, which would be higher than PGO instrumentation? Asking because I’m wondering if any PGO
training setup can be used directly for the new heap profiling.</li></ul></div></div></blockquote><div><br></div><div style="font-family:monospace;font-size:small;color:rgb(0,0,0)">It is built on top of ASAN runtime, but the overhead can be made much lower by using counter update consolidation -- all fields sharing the same shadow counter can be merged, and aggressive loop sinking/hoisting can be done.</div><div style="font-family:monospace;font-size:small;color:rgb(0,0,0)"><br></div><div style="font-family:monospace;font-size:small;color:rgb(0,0,0)">The goal is to integrate this with the PGO instrumentation. The PGO instrumentation overhead can be further reduced with sampling technique (Rong Xu has a patch to be submitted).</div><div style="font-family:monospace;font-size:small;color:rgb(0,0,0)"><br></div><div style="font-family:monospace;font-size:small;color:rgb(0,0,0)"></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div lang="EN-US"><div><ul style="margin-top:0in" type="disc"><li style="margin-left:0in"> <u></u><u></u></li><li style="margin-left:0in">I’m not familiar with how sanitizer handles stack trace, but for getting most accurate calling context (use FP rather than dwarf), I guess frame pointer omission and tail call opt
etc. need to be turned off? Is that going to be implied by <span style="font-family:"Courier New";color:black">
-fheapprof</span>?</li></ul></div></div></blockquote><div><br></div><div style="font-family:monospace;font-size:small;color:rgb(0,0,0)">Kostya can provide detailed answers to these questions.</div></div></div></blockquote><div><br></div><div>I'm not aware that -fsanitizer* options disable these, but I know in our environment we do disable frame pointer omission when setting up ASAN builds, and I am arranging for heap profiling builds to do the same. Not sure whether we want to do this within clang itself, would be interested in Kostya's opinion. I can't see anywhere that we are disabling tail call optimizations for ASAN though, but I might have missed it.</div><div><br></div><div>Thanks,</div><div>Teresa</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_quote"><div style="font-family:monospace;font-size:small;color:rgb(0,0,0)"><br></div><div style="font-family:monospace;font-size:small;color:rgb(0,0,0)">David</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div lang="EN-US"><div><ul style="margin-top:0in" type="disc"><li style="margin-left:0in"><u></u><u></u></li></ul>
<p class="MsoNormal"><u></u> <u></u></p>
<p class="MsoNormal">Thanks,<u></u><u></u></p>
<p class="MsoNormal">Wenlei<u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<div style="border-right:none;border-bottom:none;border-left:none;border-top:1pt solid rgb(181,196,223);padding:3pt 0in 0in">
<p class="MsoNormal"><b><span style="font-size:12pt;color:black">From: </span></b><span style="font-size:12pt;color:black">llvm-dev <<a href="mailto:llvm-dev-bounces@lists.llvm.org" target="_blank">llvm-dev-bounces@lists.llvm.org</a>> on behalf of Teresa Johnson via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>><br>
<b>Reply-To: </b>Teresa Johnson <<a href="mailto:tejohnson@google.com" target="_blank">tejohnson@google.com</a>><br>
<b>Date: </b>Wednesday, June 24, 2020 at 4:58 PM<br>
<b>To: </b>llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>>, Kostya Serebryany <<a href="mailto:kcc@google.com" target="_blank">kcc@google.com</a>>, Evgenii Stepanov <<a href="mailto:eugenis@google.com" target="_blank">eugenis@google.com</a>>, Vitaly Buka <<a href="mailto:vitalybuka@google.com" target="_blank">vitalybuka@google.com</a>><br>
<b>Cc: </b>David Li <<a href="mailto:davidxl@google.com" target="_blank">davidxl@google.com</a>><br>
<b>Subject: </b>[llvm-dev] RFC: Sanitizer-based Heap Profiler<u></u><u></u></span></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Hi all,<u></u><u></u></p>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">I've included an RFC for a heap profiler design I've been working on in conjunction with David Li. Please send any questions or feedback. For sanitizer folks, one area of feedback is on refactoring some of the *ASAN shadow setup code (see
the Shadow Memory section).<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<div>
<p class="MsoNormal">Thanks,<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal">Teresa<u></u><u></u></p>
</div>
<div>
<p class="MsoNormal"> <u></u><u></u></p>
<p style="margin-right:0in;margin-bottom:3pt;margin-left:0in">
<span style="font-size:26pt;font-family:Arial,sans-serif;color:black">RFC: Sanitizer-based Heap Profiler</span><u></u><u></u></p>
<h1 style="margin-right:0in;margin-bottom:6pt;margin-left:0in">
<span style="font-size:20pt;font-family:Arial,sans-serif;color:black;font-weight:normal">Summary</span><u></u><u></u></h1>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">This document provides an overview of an LLVM Sanitizer-based heap profiler design.</span><u></u><u></u></p>
<h1 style="margin-right:0in;margin-bottom:6pt;margin-left:0in">
<span style="font-size:20pt;font-family:Arial,sans-serif;color:black;font-weight:normal">Motivation</span><u></u><u></u></h1>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">The objective of heap memory profiling is to collect critical runtime information associated with heap memory references and information on heap memory allocations.
The profile information will be used first for tooling, and subsequently to guide the compiler optimizer and allocation runtime to layout heap objects with improved spatial locality. As a result, DTLB and cache utilization will be improved, and program IPC
(performance) will be increased due to reduced TLB and cache misses. More details on the heap profile guided optimizations will be shared in the future.</span><u></u><u></u></p>
<h1 style="margin-right:0in;margin-bottom:6pt;margin-left:0in">
<span style="font-size:20pt;font-family:Arial,sans-serif;color:black;font-weight:normal">Overview</span><u></u><u></u></h1>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">The profiler is based on compiler inserted instrumentation of load and store accesses, and utilizes runtime support to monitor heap allocations and profile
data. The target consumer of the heap memory profile information is initially tooling and ultimately automatic data layout optimizations performed by the compiler and/or allocation runtime (with the support of new allocation runtime APIs).</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">Each memory address is mapped to
</span><a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=" target="_blank"><span style="font-family:Arial,sans-serif">Shadow
Memory</span></a><span style="font-family:Arial,sans-serif;color:black">, similar to the approach used by the
</span><a href="https://github.com/google/sanitizers/wiki/AddressSanitizer" target="_blank"><span style="font-family:Arial,sans-serif">Address Sanitizer</span></a><span style="font-family:Arial,sans-serif;color:black"> (ASAN). Unlike ASAN, which maps each 8 bytes of memory
to 1 byte of shadow, the heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow location implements the profile counter (incremented on accesses to the corresponding memory). This granularity was chosen to help avoid counter overflow, but it
may be possible to consider mapping 32-bytes to 4 bytes. To avoid aliasing of shadow memory for different allocations, we must choose a minimum alignment carefully. As discussed further below, we can attain a 32-byte minimum alignment, instead of a 64-byte
alignment, by storing necessary heap information for each allocation in a 32-byte header block.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">The compiler instruments each load and store to increment the associated shadow memory counter, in order to determine hotness.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">The heap profiler runtime is responsible for tracking allocations and deallocations, including the stack at each allocation, and information such as the allocation
size and other statistics. I have implemented a prototype built using a stripped down and modified version of ASAN, however this will be a separate library utilizing sanitizer_common components.</span><u></u><u></u></p>
<h2 style="margin-right:0in;margin-bottom:6pt;margin-left:0in">
<span style="font-size:16pt;font-family:Arial,sans-serif;color:black;font-weight:normal">Compiler</span><u></u><u></u></h2>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">A simple HeapProfiler instrumentation pass instruments interesting memory accesses (loads, stores, atomics), with a simple load, increment, store of the associated
shadow memory location (computed via a mask and shift to do the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset). The handling is very similar to and based off of the ASAN instrumentation pass, with slightly different instrumentation code.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">Various techniques can be used to reduce the overhead, by aggressively coalescing counter updates (e.g. given the 32-byte alignment, accesses known to be in
the same 32-byte block, or across possible aliases since we don’t care about the dereferenced values). </span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">Additionally, the Clang driver needs to set up to link with the runtime library, much as it does with the sanitizers.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">A
</span><span style="font-family:"Courier New";color:black">-fheapprof</span><span style="font-family:Arial,sans-serif;color:black"> option is added to enable the instrumentation pass and runtime library linking. Similar to
</span><span style="font-family:"Courier New";color:black">-fprofile-generate</span><span style="font-family:Arial,sans-serif;color:black">,
</span><span style="font-family:"Courier New";color:black">-fheapprof</span><span style="font-family:Arial,sans-serif;color:black"> will accept an argument specifying the directory in which to write the profile.</span><u></u><u></u></p>
<h2 style="margin-right:0in;margin-bottom:6pt;margin-left:0in">
<span style="font-size:16pt;font-family:Arial,sans-serif;color:black;font-weight:normal">Runtime</span><u></u><u></u></h2>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">The heap profiler runtime is responsible for tracking and reporting information about heap allocations and accesses, aggregated by allocation calling context.
For example, the hotness, lifetime, and cpu affinity. </span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">A new heapprof library will be created within compiler-rt. It will leverage support within sanitizer_common, which already contains facilities like stack context
tracking, needed by the heap profiler.</span><u></u><u></u></p>
<h3 style="margin-right:0in;margin-bottom:4pt;margin-left:0in">
<span style="font-size:14pt;font-family:Arial,sans-serif;color:rgb(67,67,67);font-weight:normal">Shadow Memory</span><u></u><u></u></h3>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">There are some basic facilities in sanitizer_common for mmap’ing the shadow memory, but most of the existing setup lives in the ASAN and HWASAN libraries. In
the case of ASAN, there is support for both statically assigned shadow offsets (the default on most platforms), and for dynamically assigned shadow memory (implemented for Windows and currently also used for Android and iOS). According to kcc, recent experiments
show that the performance with a dynamic shadow is close to that with a static mapping. In fact, that is the only approach currently used by HWASAN. Given the simplicity, the heap profiler will be implemented with a dynamic shadow as well.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">There are a number of functions in ASAN and HWASAN related to setup of the shadow that are duplicated but very nearly identical, at least for linux (which seems
to be the only OS flavor currently supported for HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and FindDynamicShadowStart (in ASAN there is another nearly identical copy in PremapShadow, used by Android, whereas in HW ASAN the premap handling is already
commoned with the non-premap handling). Rather than make yet another copy of these mechanisms, I propose refactoring them into sanitizer_common versions. Like HWASAN, the initial version of the heap profiler will be supported for linux only, but other OSes
can be added as needed similar to ASAN.</span><u></u><u></u></p>
<h3 style="margin-right:0in;margin-bottom:4pt;margin-left:0in">
<span style="font-size:14pt;font-family:Arial,sans-serif;color:rgb(67,67,67);font-weight:normal">StackTrace and StackDepot</span><u></u><u></u></h3>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">The sanitizer already contains support for obtaining and representing a stack trace in a StackTrace object, and storing it in the StackDepot which “efficiently
stores huge amounts of stack traces”. This is in the sanitizer_common subdirectory and the support is shared by ASAN and ThreadSanitizer. The StackDepot is essentially an unbounded hash table, where each StackTrace is assigned a unique id. ASAN stores this
id in the alloc_context_id field in each ChunkHeader (in the redzone preceding each allocation). Additionally, there is support for symbolizing and printing StackTrace objects.</span><u></u><u></u></p>
<h3 style="margin-right:0in;margin-bottom:4pt;margin-left:0in">
<span style="font-size:14pt;font-family:Arial,sans-serif;color:rgb(67,67,67);font-weight:normal">ChunkHeader</span><u></u><u></u></h3>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">The heap profiler needs to track several pieces of information for each allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve a minimum
of 32-byte alignment by holding this information in a 32-byte header block preceding each allocation.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It contains information about the current allocation state, user requested size, allocation and
free thread ids, the allocation context id (representing the call stack at allocation, assigned by the StackDepot as described above), and misc other bookkeeping. For heap profiling, this will be converted to a 32-byte header block.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">Note that we could instead use the metadata section, similar to other sanitizers, which is stored in a separate location. However, as described above, storing
the header block with each allocation enables 32-byte alignment without aliasing shadow counters for the same 64 bytes of memory.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">In the prototype heap profiler implementation, the header contains the following fields:</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black">// Should be 32 bytes</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black">struct ChunkHeader {</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // 1-st 4 bytes</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Carry over from ASAN (available, allocated, quarantined). Will be</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // reduced to 1 bit (available or allocated). </span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 chunk_state : 8;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Carry over from ASAN. Used to determine the start of user allocation.</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 from_memalign : 1;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // 23 bits available</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // 2-nd 4 bytes</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Carry over from ASAN (comment copied verbatim).</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // This field is used for small sizes. For large sizes it is equal to</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // SizeClassMap::kMaxSize and the actual size is stored in the</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // SecondaryAllocator's metadata.</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 user_requested_size : 29;</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // 3-rd 4 bytes</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 cpu_id; // Allocation cpu id</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // 4-th 4 bytes</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Allocation timestamp in ms from a baseline timestamp computed at</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // the start of profiling (to keep this within 32 bits).</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 timestamp_ms;</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // 5-th and 6-th 4 bytes</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Carry over from ASAN. Used to identify allocation stack trace.</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u64 alloc_context_id;</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // 7-th and 8-th 4 bytes</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // UNIMPLEMENTED in prototype - needs instrumentation and IR support.</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u64 data_type_id; // hash of type name</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black">};</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:Arial,sans-serif;color:black">As noted, the chunk state can be reduced to a single bit (no need for quarantined memory in the heap profiler). The header contains a placeholder
for the data type hash, which is not yet implemented as it needs instrumentation and IR support.</span><u></u><u></u></p>
<h3 style="margin-right:0in;margin-bottom:4pt;margin-left:0in">
<span style="font-size:14pt;font-family:Arial,sans-serif;color:rgb(67,67,67);font-weight:normal">Heap Info Block (HIB)</span><u></u><u></u></h3>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">On a deallocation, information from the corresponding shadow block(s) and header are recorded in a Heap Info Block (HIB) object. The access count is computed
from the shadow memory locations for the allocation, as well as the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero 8-byte shadow locations for the whole allocation). Other information such as the deallocation timestamp (for lifetime
computation) and deallocation cpu id (to determine migrations) are recorded along with the information in the chunk header recorded on allocation.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">The prototyped HIB object tracks the following:</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black">struct HeapInfoBlock {</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Total allocations at this stack context</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 alloc_count;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Access count computed from all allocated 64-byte blocks (track total</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // across all allocations, and the min and max).</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u64 total_access_count, min_access_count, max_access_count;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Allocated size (track total across all allocations, and the min and max).</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u64 total_size;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 min_size, max_size;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Lifetime (track total across all allocations, and the min and max).</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u64 total_lifetime;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 min_lifetime, max_lifetime;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Percent utilization of allocated 64-byte blocks (track total</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // across all allocations, and the min and max). The utilization is</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // defined as the percentage of 8-byte shadow counters corresponding to</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // the full allocation that are non-zero.</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u64 total_percent_utilized;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 min_percent_utilized, max_percent_utilized;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Allocation and deallocation timestamps from the most recent merge into</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // the table with this stack context.</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 alloc_timestamp, dealloc_timestamp;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Allocation and deallocation cpu ids from the most recent merge into</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // the table with this stack context.</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 alloc_cpu_id, dealloc_cpu_id;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Count of allocations at this stack context that had a different</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // allocation and deallocation cpu id.</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 num_migrated_cpu;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Number of times the lifetime of the entry being merged had its lifetime</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // overlap with the previous entry merged with this stack context (by</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // comparing the new alloc/dealloc timestamp with the one last recorded in</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // the entry in the table.</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 num_lifetime_overlaps;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Number of times the alloc/dealloc cpu of the entry being merged was the</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // same as that of the previous entry merged with this stack context</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 num_same_alloc_cpu;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u32 num_same_dealloc_cpu;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // Hash of type name (UNIMPLEMENTED). This needs instrumentation support and</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> // possibly IR changes.</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black"> u64 data_type_id;</span><u></u><u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-size:10pt;font-family:"Courier New";color:black">}</span><u></u><u></u></p>
<h3 style="margin-right:0in;margin-bottom:4pt;margin-left:0in">
<span style="font-size:14pt;font-family:Arial,sans-serif;color:rgb(67,67,67);font-weight:normal">HIB Table</span><u></u><u></u></h3>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">The Heap Info Block Table, which is a multi-way associative cache, holds HIB objects from deallocated objects. It is indexed by the stack allocation context
id from the chunk header, and currently utilizes a simple mod with a prime number close to a power of two as the hash (because of the way the stack context ids are assigned, a mod of a power of two performs very poorly). Thus far, only 4-way associativity
has been evaluated. </span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">HIB entries are added or merged into the HIB Table on each deallocation. If an entry with a matching stack alloc context id is found in the Table, the newly
deallocated information is merged into the existing entry. Each HIB Table entry currently tracks the min, max and total value of the various fields for use in computing and reporting the min, max and average when the Table is ultimately dumped.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">If no entry with a matching stack alloc context id is found, a new entry is created. If this causes an eviction, the evicted entry is dumped immediately (by
default to stderr, otherwise to a specified report file). Later post processing can merge dumped entries with the same stack alloc context id.</span><u></u><u></u></p>
<h3 style="margin-right:0in;margin-bottom:4pt;margin-left:0in">
<span style="font-size:14pt;font-family:Arial,sans-serif;color:rgb(67,67,67);font-weight:normal">Initialization</span><u></u><u></u></h3>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">For ASAN, an __asan_init function initializes the memory allocation tracking support, and the ASAN instrumentation pass in LLVM creates a global constructor
to invoke it. The heap profiler prototype adds a new __heapprof_init function, which performs heap profile specific initialization, and the heap profile instrumentation pass calls this new init function instead by a generated global constructor. It currently
additionally invokes __asan_init since we are leveraging a modified ASAN runtime. Eventually, this should be changed to initialize refactored common support.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">Note that __asan init is also placed in the .preinit_array when it is available, so it is invoked even earlier than global constructors. Currently, it is not
possible to do this for __heapprof_init, as it calls timespec_get in order to get a baseline timestamp (as described in the ChunkHeader comments the timestamps (ms) are actually offsets from the baseline timestamp, in order to fit into 32 bits), and system
calls cannot be made that early (dl_init is not complete). Since the constructor priority is 1, it should be executed early enough that there are very few allocations before it runs, and likely the best solution is to simply ignore any allocations before initialization.</span><u></u><u></u></p>
<h3 style="margin-right:0in;margin-bottom:4pt;margin-left:0in">
<span style="font-size:14pt;font-family:Arial,sans-serif;color:rgb(67,67,67);font-weight:normal">Dumping</span><u></u><u></u></h3>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">For the prototype, the profile is dumped as text with a compact raw format to limit its size. Ultimately it should be dumped in a more compact binary format
(i.e. into a different section of the raw instrumentation based profile, with llvm-profdata performing post-processing) which is TBD.</span><u></u><u></u></p>
<h4 style="margin-right:0in;margin-bottom:4pt;margin-left:0in">
<span style="font-family:Arial,sans-serif;color:rgb(102,102,102);font-weight:normal">HIB Dumping</span><u></u><u></u></h4>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">As noted earlier, HIB Table entries are created as memory is deallocated. At the end of the run (or whenever dumping is requested, discussed later), HIB entries
need to be created for allocations that are still live. Conveniently, the sanitizer allocator already contains a mechanism to walk through all chunks of memory it is tracking (</span><span style="font-family:"Courier New";color:black">ForEachChunk</span><span style="font-family:Arial,sans-serif;color:black">).
The heap profiler simply looks for all chunks with a chunk state of allocated, and creates a HIB the same as would be done on deallocation, adding each to the table.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">A HIB Table mechanism for printing each entry is then invoked.</span><u></u><u></u></p>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">By default, the dumping occurs:</span><u></u><u></u></p>
<ul style="margin-top:0in" type="disc">
<li style="color:black;margin-top:0in;margin-bottom:0.0001pt;vertical-align:baseline;font-variant-numeric:normal;font-variant-east-asian:normal">
<span style="font-family:Arial,sans-serif">on evictions<u></u><u></u></span></li><li style="color:black;margin-top:0in;margin-bottom:0.0001pt;vertical-align:baseline;font-variant-numeric:normal;font-variant-east-asian:normal">
<span style="font-family:Arial,sans-serif">full table at exit (when the static Allocator object is destructed)<u></u><u></u></span></li></ul>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">For running in a load testing scenario, we will want to add a mechanism to provoke finalization (merging currently live allocations) and dumping of the HIB
Table before exit. This would be similar to the __llvm_profile_dump facility used for normal PGO counter dumping.</span><u></u><u></u></p>
<h4 style="margin-right:0in;margin-bottom:4pt;margin-left:0in">
<span style="font-family:Arial,sans-serif;color:rgb(102,102,102);font-weight:normal">Stack Trace Dumping</span><u></u><u></u></h4>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">There is existing support for dumping symbolized StackTrace objects. A wrapper to dump all StackTrace objects in the StackDepot will be added. This new interface
is invoked just after the HIB Table is dumped (on exit or via dumping interface).</span><u></u><u></u></p>
<h4 style="margin-right:0in;margin-bottom:4pt;margin-left:0in">
<span style="font-family:Arial,sans-serif;color:rgb(102,102,102);font-weight:normal">Memory Map Dumping</span><u></u><u></u></h4>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">In cases where we may want to symbolize as a post processing step, we may need the memory map (from /proc/self/smaps). Specifically, this is needed to symbolize
binaries using ASLR (Address Space Layout Randomization). There is already support for reading this file and dumping it to the specified report output file (DumpProcessMap()). This is invoked when the profile output file is initialized (HIB Table construction),
so that the memory map is available at the top of the raw profile.</span><u></u><u></u></p>
<h1 style="margin-right:0in;margin-bottom:6pt;margin-left:0in">
<span style="font-size:20pt;font-family:Arial,sans-serif;color:black;font-weight:normal">Current Status and Next Steps</span><u></u><u></u></h1>
<p class="MsoNormal"><u></u> <u></u></p>
<p style="margin:0in 0in 0.0001pt"><span style="font-family:Arial,sans-serif;color:black">As mentioned earlier, I have a working prototype based on a simplified stripped down version of ASAN. My current plan is to do the following:</span><u></u><u></u></p>
<ol style="margin-top:0in" start="1" type="1">
<li style="color:black;margin-top:0in;margin-bottom:0.0001pt;vertical-align:baseline;font-variant-numeric:normal;font-variant-east-asian:normal">
<span style="font-family:Arial,sans-serif">Refactor out some of the shadow setup code common between ASAN and HWASAN into sanitizer_common.<u></u><u></u></span></li><li style="color:black;margin-top:0in;margin-bottom:0.0001pt;vertical-align:baseline;font-variant-numeric:normal;font-variant-east-asian:normal">
<span style="font-family:Arial,sans-serif">Rework my prototype into a separate heapprof library in compiler-rt, using sanitizer_common support where possible, and send patches for review.<u></u><u></u></span></li><li style="color:black;margin-top:0in;margin-bottom:0.0001pt;vertical-align:baseline;font-variant-numeric:normal;font-variant-east-asian:normal">
<span style="font-family:Arial,sans-serif">Send patches for the heap profiler instrumentation pass and related clang options.<u></u><u></u></span></li><li style="color:black;margin-top:0in;margin-bottom:0.0001pt;vertical-align:baseline;font-variant-numeric:normal;font-variant-east-asian:normal">
<span style="font-family:Arial,sans-serif">Design/implement binary profile format<u></u><u></u></span></li></ol>
<div>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
<p class="MsoNormal">-- <u></u><u></u></p>
<div>
<div>
<div>
<table border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td nowrap style="border-right:none;border-bottom:none;border-left:none;border-top:1.5pt solid rgb(213,15,37);padding:0in">
<p class="MsoNormal"><span style="font-size:12pt;font-family:Arial,sans-serif;color:rgb(85,85,85)">Teresa Johnson |<u></u><u></u></span></p>
</td>
<td nowrap style="border-right:none;border-bottom:none;border-left:none;border-top:1.5pt solid rgb(51,105,232);padding:0in">
<p class="MsoNormal"><span style="font-size:12pt;font-family:Arial,sans-serif;color:rgb(85,85,85)"> Software Engineer |<u></u><u></u></span></p>
</td>
<td nowrap style="border-right:none;border-bottom:none;border-left:none;border-top:1.5pt solid rgb(0,153,57);padding:0in">
<p class="MsoNormal"><span style="font-size:12pt;font-family:Arial,sans-serif;color:rgb(85,85,85)"> <a href="mailto:tejohnson@google.com" target="_blank">tejohnson@google.com</a> |<u></u><u></u></span></p>
</td>
<td nowrap style="border-right:none;border-bottom:none;border-left:none;border-top:1.5pt solid rgb(238,178,17);padding:0in">
</td>
</tr>
</tbody>
</table>
<p class="MsoNormal"><u></u> <u></u></p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote></div></div>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr"><div><span style="font-family:Times;font-size:medium"><table cellspacing="0" cellpadding="0"><tbody><tr style="color:rgb(85,85,85);font-family:sans-serif;font-size:small"><td nowrap style="border-top:2px solid rgb(213,15,37)">Teresa Johnson |</td><td nowrap style="border-top:2px solid rgb(51,105,232)"> Software Engineer |</td><td nowrap style="border-top:2px solid rgb(0,153,57)"> <a href="mailto:tejohnson@google.com" target="_blank">tejohnson@google.com</a> |</td><td nowrap style="border-top:2px solid rgb(238,178,17)"><br></td></tr></tbody></table></span></div></div></div></div></div>