[llvm-dev] RFC: Sanitizer-based Heap Profiler

Sun Jul 5 13:43:50 PDT 2020

On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote:

> This sounds very useful. We’ve improved and used memoro
> <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling and
> analysis, and we are also looking for ways to leverage memory profile for
> PGO/FDO. I think having a common profiling infrastructure for analysis
> tooling as well as profile guided optimizations is good design, and having
> it in LLVM is also helpful. Very interested in the tooling and optimization
> that comes after the profiler.
>
>
>
> Two questions:
>
>    - How does the profiling overhead look? Is that similar to ASAN
>    overhead from what you’ve seen, which would be higher than PGO
>    instrumentation? Asking because I’m wondering if any PGO training setup can
>    be used directly for the new heap profiling.
>
>
It is built on top of ASAN runtime, but the overhead can be made much lower
by using counter update consolidation -- all fields sharing the same shadow
counter can be merged, and aggressive loop sinking/hoisting can be done.

The goal is to integrate this with the PGO instrumentation. The PGO
instrumentation overhead can be further reduced with sampling technique
(Rong Xu has a patch to be submitted).

>    -
>    - I’m not familiar with how sanitizer handles stack trace, but for
>    getting most accurate calling context (use FP rather than dwarf), I guess
>    frame pointer omission and tail call opt etc. need to be turned off? Is
>    that going to be implied by -fheapprof?
>
>
Kostya can provide detailed answers to these questions.

David

>
>    -
>
>
>
> Thanks,
>
> Wenlei
>
>
>
> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Teresa
> Johnson via llvm-dev <llvm-dev at lists.llvm.org>
> *Reply-To: *Teresa Johnson <tejohnson at google.com>
> *Date: *Wednesday, June 24, 2020 at 4:58 PM
> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany <
> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>, Vitaly Buka <
> vitalybuka at google.com>
> *Cc: *David Li <davidxl at google.com>
> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler
>
>
>
> Hi all,
>
>
>
> I've included an RFC for a heap profiler design I've been working on in
> conjunction with David Li. Please send any questions or feedback. For
> sanitizer folks, one area of feedback is on refactoring some of the *ASAN
> shadow setup code (see the Shadow Memory section).
>
>
>
> Thanks,
>
> Teresa
>
>
>
> RFC: Sanitizer-based Heap Profiler
> Summary
>
> This document provides an overview of an LLVM Sanitizer-based heap
> profiler design.
> Motivation
>
> The objective of heap memory profiling is to collect critical runtime
> information associated with heap memory references and information on heap
> memory allocations. The profile information will be used first for tooling,
> and subsequently to guide the compiler optimizer and allocation runtime to
> layout heap objects with improved spatial locality. As a  result, DTLB and
> cache utilization will be improved, and program IPC (performance) will be
> increased due to reduced TLB and cache misses. More details on the heap
> profile guided optimizations will be shared in the future.
> Overview
>
> The profiler is based on compiler inserted instrumentation of load and
> store accesses, and utilizes runtime support to monitor heap allocations
> and profile data. The target consumer of the heap memory profile
> information is initially tooling and ultimately automatic data layout
> optimizations performed by the compiler and/or allocation runtime (with the
> support of new allocation runtime APIs).
>
>
>
> Each memory address is mapped to Shadow Memory
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>,
> similar to the approach used by the Address Sanitizer
> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN).
> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow
> location implements the profile counter (incremented on accesses to the
> corresponding memory). This granularity was chosen to help avoid counter
> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes.
> To avoid aliasing of shadow memory for different allocations, we must
> choose a minimum alignment carefully. As discussed further below, we can
> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
> storing necessary heap information for each allocation in a 32-byte header
> block.
>
>
>
> The compiler instruments each load and store to increment the associated
> shadow memory counter, in order to determine hotness.
>
>
>
> The heap profiler runtime is responsible for tracking allocations and
> deallocations, including the stack at each allocation, and information such
> as the allocation size and other statistics. I have implemented a prototype
> built using a stripped down and modified version of ASAN, however this will
> be a separate library utilizing sanitizer_common components.
> Compiler
>
> A simple HeapProfiler instrumentation pass instruments interesting memory
> accesses (loads, stores, atomics), with a simple load, increment, store of
> the associated shadow memory location (computed via a mask and shift to do
> the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset).
> The handling is very similar to and based off of the ASAN instrumentation
> pass, with slightly different instrumentation code.
>
>
>
> Various techniques can be used to reduce the overhead, by aggressively
> coalescing counter updates (e.g. given the 32-byte alignment, accesses
> known to be in the same 32-byte block, or across possible aliases since we
> don’t care about the dereferenced values).
>
>
>
> Additionally, the Clang driver needs to set up to link with the runtime
> library, much as it does with the sanitizers.
>
>
>
> A -fheapprof option is added to enable the instrumentation pass and
> runtime library linking. Similar to -fprofile-generate, -fheapprof will
> accept an argument specifying the directory in which to write the profile.
> Runtime
>
> The heap profiler runtime is responsible for tracking and reporting
> information about heap allocations and accesses, aggregated by allocation
> calling context. For example, the hotness, lifetime, and cpu affinity.
>
>
>
> A new heapprof library will be created within compiler-rt. It will
> leverage support within sanitizer_common, which already contains facilities
> like stack context tracking, needed by the heap profiler.
> Shadow Memory
>
> There are some basic facilities in sanitizer_common for mmap’ing the
> shadow memory, but most of the existing setup lives in the ASAN and HWASAN
> libraries. In the case of ASAN, there is support for both statically
> assigned shadow offsets (the default on most platforms), and for
> dynamically assigned shadow memory (implemented for Windows and currently
> also used for Android and iOS). According to kcc, recent experiments show
> that the performance with a dynamic shadow is close to that with a static
> mapping. In fact, that is the only approach currently used by HWASAN. Given
> the simplicity, the heap profiler will be implemented with a dynamic shadow
> as well.
>
>
>
> There are a number of functions in ASAN and HWASAN related to setup of the
> shadow that are duplicated but very nearly identical, at least for linux
> (which seems to be the only OS flavor currently supported for HWASAN). E.g.
> ReserveShadowMemoryRange, ProtectGap, and FindDynamicShadowStart (in ASAN
> there is another nearly identical copy in PremapShadow, used by Android,
> whereas in HW ASAN the premap handling is already commoned with the
> non-premap handling). Rather than make yet another copy of these
> mechanisms, I propose refactoring them into sanitizer_common versions. Like
> HWASAN, the initial version of the heap profiler will be supported for
> linux only, but other OSes can be added as needed similar to ASAN.
> StackTrace and StackDepot
>
> The sanitizer already contains support for obtaining and representing a
> stack trace in a StackTrace object, and storing it in the StackDepot which
> “efficiently stores huge amounts of stack traces”. This is in the
> sanitizer_common subdirectory and the support is shared by ASAN and
> ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
> where each StackTrace is assigned a unique id. ASAN stores this id in the
> alloc_context_id field in each ChunkHeader (in the redzone preceding each
> allocation). Additionally, there is support for symbolizing and printing
> StackTrace objects.
> ChunkHeader
>
> The heap profiler needs to track several pieces of information for each
> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve
> a minimum of 32-byte alignment by holding this information in a 32-byte
> header block preceding each allocation.
>
>
>
> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It contains
> information about the current allocation state, user requested size,
> allocation and free thread ids, the allocation context id (representing the
> call stack at allocation, assigned by the StackDepot as described above),
> and misc other bookkeeping. For heap profiling, this will be converted to a
> 32-byte header block.
>
>
>
> Note that we could instead use the metadata section, similar to other
> sanitizers, which is stored in a separate location. However, as described
> above, storing the header block with each allocation enables 32-byte
> alignment without aliasing shadow counters for the same 64 bytes of memory.
>
>
>
> In the prototype heap profiler implementation, the header contains the
> following fields:
>
>
>
> // Should be 32 bytes
>
> struct ChunkHeader {
>
>   // 1-st 4 bytes
>
>   // Carry over from ASAN (available, allocated, quarantined). Will be
>
>   // reduced to 1 bit (available or allocated).
>
>   u32 chunk_state       : 8;
>
>   // Carry over from ASAN. Used to determine the start of user allocation.
>
>   u32 from_memalign     : 1;
>
>   // 23 bits available
>
>
>
>   // 2-nd 4 bytes
>
>   // Carry over from ASAN (comment copied verbatim).
>
>   // This field is used for small sizes. For large sizes it is equal to
>
>   // SizeClassMap::kMaxSize and the actual size is stored in the
>
>   // SecondaryAllocator's metadata.
>
>   u32 user_requested_size : 29;
>
>
>
>   // 3-rd 4 bytes
>
>   u32 cpu_id; // Allocation cpu id
>
>
>
>   // 4-th 4 bytes
>
>   // Allocation timestamp in ms from a baseline timestamp computed at
>
>   // the start of profiling (to keep this within 32 bits).
>
>   u32 timestamp_ms;
>
>
>
>   // 5-th and 6-th 4 bytes
>
>   // Carry over from ASAN. Used to identify allocation stack trace.
>
>   u64 alloc_context_id;
>
>
>
>   // 7-th and 8-th 4 bytes
>
>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>
>   u64 data_type_id; // hash of type name
>
> };
>
> As noted, the chunk state can be reduced to a single bit (no need for
> quarantined memory in the heap profiler). The header contains a placeholder
> for the data type hash, which is not yet implemented as it needs
> instrumentation and IR support.
> Heap Info Block (HIB)
>
> On a deallocation, information from the corresponding shadow block(s) and
> header are recorded in a Heap Info Block (HIB) object. The access count is
> computed from the shadow memory locations for the allocation, as well as
> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero
> 8-byte shadow locations for the whole allocation). Other information such
> as the deallocation timestamp (for lifetime computation) and deallocation
> cpu id (to determine migrations) are recorded along with the information in
> the chunk header recorded on allocation.
>
>
>
> The prototyped HIB object tracks the following:
>
>
>
> struct HeapInfoBlock {
>
>   // Total allocations at this stack context
>
>   u32 alloc_count;
>
>   // Access count computed from all allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max).
>
>   u64 total_access_count, min_access_count, max_access_count;
>
>   // Allocated size (track total across all allocations, and the min and
> max).
>
>   u64 total_size;
>
>   u32 min_size, max_size;
>
>   // Lifetime (track total across all allocations, and the min and max).
>
>   u64 total_lifetime;
>
>   u32 min_lifetime, max_lifetime;
>
>   // Percent utilization of allocated 64-byte blocks (track total
>
>   // across all allocations, and the min and max). The utilization is
>
>   // defined as the percentage of 8-byte shadow counters corresponding to
>
>   // the full allocation that are non-zero.
>
>   u64 total_percent_utilized;
>
>   u32 min_percent_utilized, max_percent_utilized;
>
>   // Allocation and deallocation timestamps from the most recent merge into
>
>   // the table with this stack context.
>
>   u32 alloc_timestamp, dealloc_timestamp;
>
>   // Allocation and deallocation cpu ids from the most recent merge into
>
>   // the table with this stack context.
>
>   u32 alloc_cpu_id, dealloc_cpu_id;
>
>   // Count of allocations at this stack context that had a different
>
>   // allocation and deallocation cpu id.
>
>   u32 num_migrated_cpu;
>
>   // Number of times the lifetime of the entry being merged had its
> lifetime
>
>   // overlap with the previous entry merged with this stack context (by
>
>   // comparing the new alloc/dealloc timestamp with the one last recorded
> in
>
>   // the entry in the table.
>
>   u32 num_lifetime_overlaps;
>
>   // Number of times the alloc/dealloc cpu of the entry being merged was
> the
>
>   // same as that of the previous entry merged with this stack context
>
>   u32 num_same_alloc_cpu;
>
>   u32 num_same_dealloc_cpu;
>
>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation support
> and
>
>   // possibly IR changes.
>
>   u64 data_type_id;
>
> }
> HIB Table
>
> The Heap Info Block Table, which is a multi-way associative cache, holds
> HIB objects from deallocated objects. It is indexed by the stack allocation
> context id from the chunk header, and currently utilizes a simple mod with
> a prime number close to a power of two as the hash (because of the way the
> stack context ids are assigned, a mod of a power of two performs very
> poorly). Thus far, only 4-way associativity has been evaluated.
>
>
>
> HIB entries are added or merged into the HIB Table on each deallocation.
> If an entry with a matching stack alloc context id is found in the Table,
> the newly deallocated information is merged into the existing entry. Each
> HIB Table entry currently tracks the min, max and total value of the
> various fields for use in computing and reporting the min, max and average
> when the Table is ultimately dumped.
>
>
>
> If no entry with a matching stack alloc context id is found, a new entry
> is created. If this causes an eviction, the evicted entry is dumped
> immediately (by default to stderr, otherwise to a specified report file).
> Later post processing can merge dumped entries with the same stack alloc
> context id.
> Initialization
>
>
>
> For ASAN, an __asan_init function initializes the memory allocation
> tracking support, and the ASAN instrumentation pass in LLVM creates a
> global constructor to invoke it. The heap profiler prototype adds a new
> __heapprof_init function, which performs heap profile specific
> initialization, and the heap profile instrumentation pass calls this new
> init function instead by a generated global constructor. It currently
> additionally invokes __asan_init since we are leveraging a modified ASAN
> runtime. Eventually, this should be changed to initialize refactored common
> support.
>
>
>
> Note that __asan init is also placed in the .preinit_array when it is
> available, so it is invoked even earlier than global constructors.
> Currently, it is not possible to do this for __heapprof_init, as it calls
> timespec_get in order to get a baseline timestamp (as described in the
> ChunkHeader comments the timestamps (ms) are actually offsets from the
> baseline timestamp, in order to fit into 32 bits), and system calls cannot
> be made that early (dl_init is not complete). Since the constructor
> priority is 1, it should be executed early enough that there are very few
> allocations before it runs, and likely the best solution is to simply
> ignore any allocations before initialization.
> Dumping
>
> For the prototype, the profile is dumped as text with a compact raw format
> to limit its size. Ultimately it should be dumped in a more compact binary
> format (i.e. into a different section of the raw instrumentation based
> profile, with llvm-profdata performing post-processing) which is TBD.
> HIB Dumping
>
> As noted earlier, HIB Table entries are created as memory is deallocated.
> At the end of the run (or whenever dumping is requested, discussed later),
> HIB entries need to be created for allocations that are still live.
> Conveniently, the sanitizer allocator already contains a mechanism to walk
> through all chunks of memory it is tracking (ForEachChunk). The heap
> profiler simply looks for all chunks with a chunk state of allocated, and
> creates a HIB the same as would be done on deallocation, adding each to the
> table.
>
>
>
> A HIB Table mechanism for printing each entry is then invoked.
>
>
>
> By default, the dumping occurs:
>
>    - on evictions
>    - full table at exit (when the static Allocator object is destructed)
>
>
>
> For running in a load testing scenario, we will want to add a mechanism to
> provoke finalization (merging currently live allocations) and dumping of
> the HIB Table before exit. This would be similar to the __llvm_profile_dump
> facility used for normal PGO counter dumping.
> Stack Trace Dumping
>
> There is existing support for dumping symbolized StackTrace objects. A
> wrapper to dump all StackTrace objects in the StackDepot will be added.
> This new interface is invoked just after the HIB Table is dumped (on exit
> or via dumping interface).
> Memory Map Dumping
>
> In cases where we may want to symbolize as a post processing step, we may
> need the memory map (from /proc/self/smaps). Specifically, this is needed
> to symbolize binaries using ASLR (Address Space Layout Randomization).
> There is already support for reading this file and dumping it to the
> specified report output file (DumpProcessMap()). This is invoked when the
> profile output file is initialized (HIB Table construction), so that the
> memory map is available at the top of the raw profile.
> Current Status and Next Steps
>
>
>
> As mentioned earlier, I have a working prototype based on a simplified
> stripped down version of ASAN. My current plan is to do the following:
>
>    1. Refactor out some of the shadow setup code common between ASAN and
>    HWASAN into sanitizer_common.
>    2. Rework my prototype into a separate heapprof library in
>    compiler-rt, using sanitizer_common support where possible, and send
>    patches for review.
>    3. Send patches for the heap profiler instrumentation pass and related
>    clang options.
>    4. Design/implement binary profile format
>
>
>
> --
>
> Teresa Johnson |
>
>  Software Engineer |
>
>  tejohnson at google.com |
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200705/51c0a31c/attachment-0001.html>