[llvm-dev] RFC: Sanitizer-based Heap Profiler

Mon Jul 6 07:58:43 PDT 2020

Hi Wenlei,

Thanks for the comments! David answered the first question, I do have some
comments on the second one though.
Teresa

On Sun, Jul 5, 2020 at 1:44 PM Xinliang David Li <davidxl at google.com> wrote:

>
>
> On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote:
>
>> This sounds very useful. We’ve improved and used memoro
>> <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling and
>> analysis, and we are also looking for ways to leverage memory profile for
>> PGO/FDO. I think having a common profiling infrastructure for analysis
>> tooling as well as profile guided optimizations is good design, and having
>> it in LLVM is also helpful. Very interested in the tooling and optimization
>> that comes after the profiler.
>>
>>
>>
>> Two questions:
>>
>>    - How does the profiling overhead look? Is that similar to ASAN
>>    overhead from what you’ve seen, which would be higher than PGO
>>    instrumentation? Asking because I’m wondering if any PGO training setup can
>>    be used directly for the new heap profiling.
>>
>>
> It is built on top of ASAN runtime, but the overhead can be made much
> lower by using counter update consolidation -- all fields sharing the same
> shadow counter can be merged, and aggressive loop sinking/hoisting can be
> done.
>
> The goal is to integrate this with the PGO instrumentation. The PGO
> instrumentation overhead can be further reduced with sampling technique
> (Rong Xu has a patch to be submitted).
>
>
>>    -
>>    - I’m not familiar with how sanitizer handles stack trace, but for
>>    getting most accurate calling context (use FP rather than dwarf), I guess
>>    frame pointer omission and tail call opt etc. need to be turned off? Is
>>    that going to be implied by -fheapprof?
>>
>>
> Kostya can provide detailed answers to these questions.
>

I'm not aware that -fsanitizer* options disable these, but I know in our
environment we do disable frame pointer omission when setting up ASAN
builds, and I am arranging for heap profiling builds to do the same. Not
sure whether we want to do this within clang itself, would be interested in
Kostya's opinion. I can't see anywhere that we are disabling tail call
optimizations for ASAN though, but I might have missed it.

Thanks,
Teresa

>
> David
>
>>
>>    -
>>
>>
>>
>> Thanks,
>>
>> Wenlei
>>
>>
>>
>> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Teresa
>> Johnson via llvm-dev <llvm-dev at lists.llvm.org>
>> *Reply-To: *Teresa Johnson <tejohnson at google.com>
>> *Date: *Wednesday, June 24, 2020 at 4:58 PM
>> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany <
>> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>, Vitaly Buka <
>> vitalybuka at google.com>
>> *Cc: *David Li <davidxl at google.com>
>> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler
>>
>>
>>
>> Hi all,
>>
>>
>>
>> I've included an RFC for a heap profiler design I've been working on in
>> conjunction with David Li. Please send any questions or feedback. For
>> sanitizer folks, one area of feedback is on refactoring some of the *ASAN
>> shadow setup code (see the Shadow Memory section).
>>
>>
>>
>> Thanks,
>>
>> Teresa
>>
>>
>>
>> RFC: Sanitizer-based Heap Profiler
>> Summary
>>
>> This document provides an overview of an LLVM Sanitizer-based heap
>> profiler design.
>> Motivation
>>
>> The objective of heap memory profiling is to collect critical runtime
>> information associated with heap memory references and information on heap
>> memory allocations. The profile information will be used first for tooling,
>> and subsequently to guide the compiler optimizer and allocation runtime to
>> layout heap objects with improved spatial locality. As a  result, DTLB and
>> cache utilization will be improved, and program IPC (performance) will be
>> increased due to reduced TLB and cache misses. More details on the heap
>> profile guided optimizations will be shared in the future.
>> Overview
>>
>> The profiler is based on compiler inserted instrumentation of load and
>> store accesses, and utilizes runtime support to monitor heap allocations
>> and profile data. The target consumer of the heap memory profile
>> information is initially tooling and ultimately automatic data layout
>> optimizations performed by the compiler and/or allocation runtime (with the
>> support of new allocation runtime APIs).
>>
>>
>>
>> Each memory address is mapped to Shadow Memory
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>,
>> similar to the approach used by the Address Sanitizer
>> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN).
>> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
>> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow
>> location implements the profile counter (incremented on accesses to the
>> corresponding memory). This granularity was chosen to help avoid counter
>> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes.
>> To avoid aliasing of shadow memory for different allocations, we must
>> choose a minimum alignment carefully. As discussed further below, we can
>> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
>> storing necessary heap information for each allocation in a 32-byte header
>> block.
>>
>>
>>
>> The compiler instruments each load and store to increment the associated
>> shadow memory counter, in order to determine hotness.
>>
>>
>>
>> The heap profiler runtime is responsible for tracking allocations and
>> deallocations, including the stack at each allocation, and information such
>> as the allocation size and other statistics. I have implemented a prototype
>> built using a stripped down and modified version of ASAN, however this will
>> be a separate library utilizing sanitizer_common components.
>> Compiler
>>
>> A simple HeapProfiler instrumentation pass instruments interesting memory
>> accesses (loads, stores, atomics), with a simple load, increment, store of
>> the associated shadow memory location (computed via a mask and shift to do
>> the mapping of 64 bytes to 8 byte shadow, and add of the shadow offset).
>> The handling is very similar to and based off of the ASAN instrumentation
>> pass, with slightly different instrumentation code.
>>
>>
>>
>> Various techniques can be used to reduce the overhead, by aggressively
>> coalescing counter updates (e.g. given the 32-byte alignment, accesses
>> known to be in the same 32-byte block, or across possible aliases since we
>> don’t care about the dereferenced values).
>>
>>
>>
>> Additionally, the Clang driver needs to set up to link with the runtime
>> library, much as it does with the sanitizers.
>>
>>
>>
>> A -fheapprof option is added to enable the instrumentation pass and
>> runtime library linking. Similar to -fprofile-generate, -fheapprof will
>> accept an argument specifying the directory in which to write the profile.
>> Runtime
>>
>> The heap profiler runtime is responsible for tracking and reporting
>> information about heap allocations and accesses, aggregated by allocation
>> calling context. For example, the hotness, lifetime, and cpu affinity.
>>
>>
>>
>> A new heapprof library will be created within compiler-rt. It will
>> leverage support within sanitizer_common, which already contains facilities
>> like stack context tracking, needed by the heap profiler.
>> Shadow Memory
>>
>> There are some basic facilities in sanitizer_common for mmap’ing the
>> shadow memory, but most of the existing setup lives in the ASAN and HWASAN
>> libraries. In the case of ASAN, there is support for both statically
>> assigned shadow offsets (the default on most platforms), and for
>> dynamically assigned shadow memory (implemented for Windows and currently
>> also used for Android and iOS). According to kcc, recent experiments show
>> that the performance with a dynamic shadow is close to that with a static
>> mapping. In fact, that is the only approach currently used by HWASAN. Given
>> the simplicity, the heap profiler will be implemented with a dynamic shadow
>> as well.
>>
>>
>>
>> There are a number of functions in ASAN and HWASAN related to setup of
>> the shadow that are duplicated but very nearly identical, at least for
>> linux (which seems to be the only OS flavor currently supported for
>> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and
>> FindDynamicShadowStart (in ASAN there is another nearly identical copy in
>> PremapShadow, used by Android, whereas in HW ASAN the premap handling is
>> already commoned with the non-premap handling). Rather than make yet
>> another copy of these mechanisms, I propose refactoring them into
>> sanitizer_common versions. Like HWASAN, the initial version of the heap
>> profiler will be supported for linux only, but other OSes can be added as
>> needed similar to ASAN.
>> StackTrace and StackDepot
>>
>> The sanitizer already contains support for obtaining and representing a
>> stack trace in a StackTrace object, and storing it in the StackDepot which
>> “efficiently stores huge amounts of stack traces”. This is in the
>> sanitizer_common subdirectory and the support is shared by ASAN and
>> ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
>> where each StackTrace is assigned a unique id. ASAN stores this id in the
>> alloc_context_id field in each ChunkHeader (in the redzone preceding each
>> allocation). Additionally, there is support for symbolizing and printing
>> StackTrace objects.
>> ChunkHeader
>>
>> The heap profiler needs to track several pieces of information for each
>> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve
>> a minimum of 32-byte alignment by holding this information in a 32-byte
>> header block preceding each allocation.
>>
>>
>>
>> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It
>> contains information about the current allocation state, user requested
>> size, allocation and free thread ids, the allocation context id
>> (representing the call stack at allocation, assigned by the StackDepot as
>> described above), and misc other bookkeeping. For heap profiling, this will
>> be converted to a 32-byte header block.
>>
>>
>>
>> Note that we could instead use the metadata section, similar to other
>> sanitizers, which is stored in a separate location. However, as described
>> above, storing the header block with each allocation enables 32-byte
>> alignment without aliasing shadow counters for the same 64 bytes of memory.
>>
>>
>>
>> In the prototype heap profiler implementation, the header contains the
>> following fields:
>>
>>
>>
>> // Should be 32 bytes
>>
>> struct ChunkHeader {
>>
>>   // 1-st 4 bytes
>>
>>   // Carry over from ASAN (available, allocated, quarantined). Will be
>>
>>   // reduced to 1 bit (available or allocated).
>>
>>   u32 chunk_state       : 8;
>>
>>   // Carry over from ASAN. Used to determine the start of user allocation.
>>
>>   u32 from_memalign     : 1;
>>
>>   // 23 bits available
>>
>>
>>
>>   // 2-nd 4 bytes
>>
>>   // Carry over from ASAN (comment copied verbatim).
>>
>>   // This field is used for small sizes. For large sizes it is equal to
>>
>>   // SizeClassMap::kMaxSize and the actual size is stored in the
>>
>>   // SecondaryAllocator's metadata.
>>
>>   u32 user_requested_size : 29;
>>
>>
>>
>>   // 3-rd 4 bytes
>>
>>   u32 cpu_id; // Allocation cpu id
>>
>>
>>
>>   // 4-th 4 bytes
>>
>>   // Allocation timestamp in ms from a baseline timestamp computed at
>>
>>   // the start of profiling (to keep this within 32 bits).
>>
>>   u32 timestamp_ms;
>>
>>
>>
>>   // 5-th and 6-th 4 bytes
>>
>>   // Carry over from ASAN. Used to identify allocation stack trace.
>>
>>   u64 alloc_context_id;
>>
>>
>>
>>   // 7-th and 8-th 4 bytes
>>
>>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>>
>>   u64 data_type_id; // hash of type name
>>
>> };
>>
>> As noted, the chunk state can be reduced to a single bit (no need for
>> quarantined memory in the heap profiler). The header contains a placeholder
>> for the data type hash, which is not yet implemented as it needs
>> instrumentation and IR support.
>> Heap Info Block (HIB)
>>
>> On a deallocation, information from the corresponding shadow block(s) and
>> header are recorded in a Heap Info Block (HIB) object. The access count is
>> computed from the shadow memory locations for the allocation, as well as
>> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero
>> 8-byte shadow locations for the whole allocation). Other information such
>> as the deallocation timestamp (for lifetime computation) and deallocation
>> cpu id (to determine migrations) are recorded along with the information in
>> the chunk header recorded on allocation.
>>
>>
>>
>> The prototyped HIB object tracks the following:
>>
>>
>>
>> struct HeapInfoBlock {
>>
>>   // Total allocations at this stack context
>>
>>   u32 alloc_count;
>>
>>   // Access count computed from all allocated 64-byte blocks (track total
>>
>>   // across all allocations, and the min and max).
>>
>>   u64 total_access_count, min_access_count, max_access_count;
>>
>>   // Allocated size (track total across all allocations, and the min and
>> max).
>>
>>   u64 total_size;
>>
>>   u32 min_size, max_size;
>>
>>   // Lifetime (track total across all allocations, and the min and max).
>>
>>   u64 total_lifetime;
>>
>>   u32 min_lifetime, max_lifetime;
>>
>>   // Percent utilization of allocated 64-byte blocks (track total
>>
>>   // across all allocations, and the min and max). The utilization is
>>
>>   // defined as the percentage of 8-byte shadow counters corresponding to
>>
>>   // the full allocation that are non-zero.
>>
>>   u64 total_percent_utilized;
>>
>>   u32 min_percent_utilized, max_percent_utilized;
>>
>>   // Allocation and deallocation timestamps from the most recent merge
>> into
>>
>>   // the table with this stack context.
>>
>>   u32 alloc_timestamp, dealloc_timestamp;
>>
>>   // Allocation and deallocation cpu ids from the most recent merge into
>>
>>   // the table with this stack context.
>>
>>   u32 alloc_cpu_id, dealloc_cpu_id;
>>
>>   // Count of allocations at this stack context that had a different
>>
>>   // allocation and deallocation cpu id.
>>
>>   u32 num_migrated_cpu;
>>
>>   // Number of times the lifetime of the entry being merged had its
>> lifetime
>>
>>   // overlap with the previous entry merged with this stack context (by
>>
>>   // comparing the new alloc/dealloc timestamp with the one last recorded
>> in
>>
>>   // the entry in the table.
>>
>>   u32 num_lifetime_overlaps;
>>
>>   // Number of times the alloc/dealloc cpu of the entry being merged was
>> the
>>
>>   // same as that of the previous entry merged with this stack context
>>
>>   u32 num_same_alloc_cpu;
>>
>>   u32 num_same_dealloc_cpu;
>>
>>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation
>> support and
>>
>>   // possibly IR changes.
>>
>>   u64 data_type_id;
>>
>> }
>> HIB Table
>>
>> The Heap Info Block Table, which is a multi-way associative cache, holds
>> HIB objects from deallocated objects. It is indexed by the stack allocation
>> context id from the chunk header, and currently utilizes a simple mod with
>> a prime number close to a power of two as the hash (because of the way the
>> stack context ids are assigned, a mod of a power of two performs very
>> poorly). Thus far, only 4-way associativity has been evaluated.
>>
>>
>>
>> HIB entries are added or merged into the HIB Table on each deallocation.
>> If an entry with a matching stack alloc context id is found in the Table,
>> the newly deallocated information is merged into the existing entry. Each
>> HIB Table entry currently tracks the min, max and total value of the
>> various fields for use in computing and reporting the min, max and average
>> when the Table is ultimately dumped.
>>
>>
>>
>> If no entry with a matching stack alloc context id is found, a new entry
>> is created. If this causes an eviction, the evicted entry is dumped
>> immediately (by default to stderr, otherwise to a specified report file).
>> Later post processing can merge dumped entries with the same stack alloc
>> context id.
>> Initialization
>>
>>
>>
>> For ASAN, an __asan_init function initializes the memory allocation
>> tracking support, and the ASAN instrumentation pass in LLVM creates a
>> global constructor to invoke it. The heap profiler prototype adds a new
>> __heapprof_init function, which performs heap profile specific
>> initialization, and the heap profile instrumentation pass calls this new
>> init function instead by a generated global constructor. It currently
>> additionally invokes __asan_init since we are leveraging a modified ASAN
>> runtime. Eventually, this should be changed to initialize refactored common
>> support.
>>
>>
>>
>> Note that __asan init is also placed in the .preinit_array when it is
>> available, so it is invoked even earlier than global constructors.
>> Currently, it is not possible to do this for __heapprof_init, as it calls
>> timespec_get in order to get a baseline timestamp (as described in the
>> ChunkHeader comments the timestamps (ms) are actually offsets from the
>> baseline timestamp, in order to fit into 32 bits), and system calls cannot
>> be made that early (dl_init is not complete). Since the constructor
>> priority is 1, it should be executed early enough that there are very few
>> allocations before it runs, and likely the best solution is to simply
>> ignore any allocations before initialization.
>> Dumping
>>
>> For the prototype, the profile is dumped as text with a compact raw
>> format to limit its size. Ultimately it should be dumped in a more compact
>> binary format (i.e. into a different section of the raw instrumentation
>> based profile, with llvm-profdata performing post-processing) which is TBD.
>> HIB Dumping
>>
>> As noted earlier, HIB Table entries are created as memory is deallocated.
>> At the end of the run (or whenever dumping is requested, discussed later),
>> HIB entries need to be created for allocations that are still live.
>> Conveniently, the sanitizer allocator already contains a mechanism to walk
>> through all chunks of memory it is tracking (ForEachChunk). The heap
>> profiler simply looks for all chunks with a chunk state of allocated, and
>> creates a HIB the same as would be done on deallocation, adding each to the
>> table.
>>
>>
>>
>> A HIB Table mechanism for printing each entry is then invoked.
>>
>>
>>
>> By default, the dumping occurs:
>>
>>    - on evictions
>>    - full table at exit (when the static Allocator object is destructed)
>>
>>
>>
>> For running in a load testing scenario, we will want to add a mechanism
>> to provoke finalization (merging currently live allocations) and dumping of
>> the HIB Table before exit. This would be similar to the __llvm_profile_dump
>> facility used for normal PGO counter dumping.
>> Stack Trace Dumping
>>
>> There is existing support for dumping symbolized StackTrace objects. A
>> wrapper to dump all StackTrace objects in the StackDepot will be added.
>> This new interface is invoked just after the HIB Table is dumped (on exit
>> or via dumping interface).
>> Memory Map Dumping
>>
>> In cases where we may want to symbolize as a post processing step, we may
>> need the memory map (from /proc/self/smaps). Specifically, this is needed
>> to symbolize binaries using ASLR (Address Space Layout Randomization).
>> There is already support for reading this file and dumping it to the
>> specified report output file (DumpProcessMap()). This is invoked when the
>> profile output file is initialized (HIB Table construction), so that the
>> memory map is available at the top of the raw profile.
>> Current Status and Next Steps
>>
>>
>>
>> As mentioned earlier, I have a working prototype based on a simplified
>> stripped down version of ASAN. My current plan is to do the following:
>>
>>    1. Refactor out some of the shadow setup code common between ASAN and
>>    HWASAN into sanitizer_common.
>>    2. Rework my prototype into a separate heapprof library in
>>    compiler-rt, using sanitizer_common support where possible, and send
>>    patches for review.
>>    3. Send patches for the heap profiler instrumentation pass and
>>    related clang options.
>>    4. Design/implement binary profile format
>>
>>
>>
>> --
>>
>> Teresa Johnson |
>>
>>  Software Engineer |
>>
>>  tejohnson at google.com |
>>
>>
>>
>

-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200706/8cadee89/attachment.html>