[llvm-dev] RFC: Sanitizer-based Heap Profiler

Mitch Phillips via llvm-dev llvm-dev at lists.llvm.org
Mon Jul 6 11:48:49 PDT 2020


> I'm not aware that -fsanitizer* options disable these, but I know in our
environment we do disable frame pointer omission when setting up ASAN
builds, and I am arranging for heap profiling builds to do the same. Not
sure whether we want to do this within clang itself, would be interested in
Kostya's opinion. I can't see anywhere that we are disabling tail call
optimizations for ASAN though, but I might have missed it.

We don't force frame pointers to be emitted with -fsanitize=address at
least -- although we highly recommend it
<https://clang.llvm.org/docs/AddressSanitizer.html#usage> as the frame
pointer unwinder is much faster than DWARF, particularly important for
stack collection on malloc/free.

On Mon, Jul 6, 2020 at 7:59 AM Teresa Johnson via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> Hi Wenlei,
>
> Thanks for the comments! David answered the first question, I do have some
> comments on the second one though.
> Teresa
>
> On Sun, Jul 5, 2020 at 1:44 PM Xinliang David Li <davidxl at google.com>
> wrote:
>
>>
>>
>> On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote:
>>
>>> This sounds very useful. We’ve improved and used memoro
>>> <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling and
>>> analysis, and we are also looking for ways to leverage memory profile for
>>> PGO/FDO. I think having a common profiling infrastructure for analysis
>>> tooling as well as profile guided optimizations is good design, and having
>>> it in LLVM is also helpful. Very interested in the tooling and optimization
>>> that comes after the profiler.
>>>
>>>
>>>
>>> Two questions:
>>>
>>>    - How does the profiling overhead look? Is that similar to ASAN
>>>    overhead from what you’ve seen, which would be higher than PGO
>>>    instrumentation? Asking because I’m wondering if any PGO training setup can
>>>    be used directly for the new heap profiling.
>>>
>>>
>> It is built on top of ASAN runtime, but the overhead can be made much
>> lower by using counter update consolidation -- all fields sharing the same
>> shadow counter can be merged, and aggressive loop sinking/hoisting can be
>> done.
>>
>> The goal is to integrate this with the PGO instrumentation. The PGO
>> instrumentation overhead can be further reduced with sampling technique
>> (Rong Xu has a patch to be submitted).
>>
>>
>>>    -
>>>    - I’m not familiar with how sanitizer handles stack trace, but for
>>>    getting most accurate calling context (use FP rather than dwarf), I guess
>>>    frame pointer omission and tail call opt etc. need to be turned off? Is
>>>    that going to be implied by -fheapprof?
>>>
>>>
>> Kostya can provide detailed answers to these questions.
>>
>
> I'm not aware that -fsanitizer* options disable these, but I know in our
> environment we do disable frame pointer omission when setting up ASAN
> builds, and I am arranging for heap profiling builds to do the same. Not
> sure whether we want to do this within clang itself, would be interested in
> Kostya's opinion. I can't see anywhere that we are disabling tail call
> optimizations for ASAN though, but I might have missed it.
>
> Thanks,
> Teresa
>
>
>>
>> David
>>
>>>
>>>    -
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Wenlei
>>>
>>>
>>>
>>> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Teresa
>>> Johnson via llvm-dev <llvm-dev at lists.llvm.org>
>>> *Reply-To: *Teresa Johnson <tejohnson at google.com>
>>> *Date: *Wednesday, June 24, 2020 at 4:58 PM
>>> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany <
>>> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>, Vitaly Buka <
>>> vitalybuka at google.com>
>>> *Cc: *David Li <davidxl at google.com>
>>> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler
>>>
>>>
>>>
>>> Hi all,
>>>
>>>
>>>
>>> I've included an RFC for a heap profiler design I've been working on in
>>> conjunction with David Li. Please send any questions or feedback. For
>>> sanitizer folks, one area of feedback is on refactoring some of the *ASAN
>>> shadow setup code (see the Shadow Memory section).
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Teresa
>>>
>>>
>>>
>>> RFC: Sanitizer-based Heap Profiler
>>> Summary
>>>
>>> This document provides an overview of an LLVM Sanitizer-based heap
>>> profiler design.
>>> Motivation
>>>
>>> The objective of heap memory profiling is to collect critical runtime
>>> information associated with heap memory references and information on heap
>>> memory allocations. The profile information will be used first for tooling,
>>> and subsequently to guide the compiler optimizer and allocation runtime to
>>> layout heap objects with improved spatial locality. As a  result, DTLB and
>>> cache utilization will be improved, and program IPC (performance) will be
>>> increased due to reduced TLB and cache misses. More details on the heap
>>> profile guided optimizations will be shared in the future.
>>> Overview
>>>
>>> The profiler is based on compiler inserted instrumentation of load and
>>> store accesses, and utilizes runtime support to monitor heap allocations
>>> and profile data. The target consumer of the heap memory profile
>>> information is initially tooling and ultimately automatic data layout
>>> optimizations performed by the compiler and/or allocation runtime (with the
>>> support of new allocation runtime APIs).
>>>
>>>
>>>
>>> Each memory address is mapped to Shadow Memory
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>,
>>> similar to the approach used by the Address Sanitizer
>>> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN).
>>> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
>>> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow
>>> location implements the profile counter (incremented on accesses to the
>>> corresponding memory). This granularity was chosen to help avoid counter
>>> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes.
>>> To avoid aliasing of shadow memory for different allocations, we must
>>> choose a minimum alignment carefully. As discussed further below, we can
>>> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
>>> storing necessary heap information for each allocation in a 32-byte header
>>> block.
>>>
>>>
>>>
>>> The compiler instruments each load and store to increment the associated
>>> shadow memory counter, in order to determine hotness.
>>>
>>>
>>>
>>> The heap profiler runtime is responsible for tracking allocations and
>>> deallocations, including the stack at each allocation, and information such
>>> as the allocation size and other statistics. I have implemented a prototype
>>> built using a stripped down and modified version of ASAN, however this will
>>> be a separate library utilizing sanitizer_common components.
>>> Compiler
>>>
>>> A simple HeapProfiler instrumentation pass instruments interesting
>>> memory accesses (loads, stores, atomics), with a simple load, increment,
>>> store of the associated shadow memory location (computed via a mask and
>>> shift to do the mapping of 64 bytes to 8 byte shadow, and add of the shadow
>>> offset). The handling is very similar to and based off of the ASAN
>>> instrumentation pass, with slightly different instrumentation code.
>>>
>>>
>>>
>>> Various techniques can be used to reduce the overhead, by aggressively
>>> coalescing counter updates (e.g. given the 32-byte alignment, accesses
>>> known to be in the same 32-byte block, or across possible aliases since we
>>> don’t care about the dereferenced values).
>>>
>>>
>>>
>>> Additionally, the Clang driver needs to set up to link with the runtime
>>> library, much as it does with the sanitizers.
>>>
>>>
>>>
>>> A -fheapprof option is added to enable the instrumentation pass and
>>> runtime library linking. Similar to -fprofile-generate, -fheapprof will
>>> accept an argument specifying the directory in which to write the profile.
>>> Runtime
>>>
>>> The heap profiler runtime is responsible for tracking and reporting
>>> information about heap allocations and accesses, aggregated by allocation
>>> calling context. For example, the hotness, lifetime, and cpu affinity.
>>>
>>>
>>>
>>> A new heapprof library will be created within compiler-rt. It will
>>> leverage support within sanitizer_common, which already contains facilities
>>> like stack context tracking, needed by the heap profiler.
>>> Shadow Memory
>>>
>>> There are some basic facilities in sanitizer_common for mmap’ing the
>>> shadow memory, but most of the existing setup lives in the ASAN and HWASAN
>>> libraries. In the case of ASAN, there is support for both statically
>>> assigned shadow offsets (the default on most platforms), and for
>>> dynamically assigned shadow memory (implemented for Windows and currently
>>> also used for Android and iOS). According to kcc, recent experiments show
>>> that the performance with a dynamic shadow is close to that with a static
>>> mapping. In fact, that is the only approach currently used by HWASAN. Given
>>> the simplicity, the heap profiler will be implemented with a dynamic shadow
>>> as well.
>>>
>>>
>>>
>>> There are a number of functions in ASAN and HWASAN related to setup of
>>> the shadow that are duplicated but very nearly identical, at least for
>>> linux (which seems to be the only OS flavor currently supported for
>>> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and
>>> FindDynamicShadowStart (in ASAN there is another nearly identical copy in
>>> PremapShadow, used by Android, whereas in HW ASAN the premap handling is
>>> already commoned with the non-premap handling). Rather than make yet
>>> another copy of these mechanisms, I propose refactoring them into
>>> sanitizer_common versions. Like HWASAN, the initial version of the heap
>>> profiler will be supported for linux only, but other OSes can be added as
>>> needed similar to ASAN.
>>> StackTrace and StackDepot
>>>
>>> The sanitizer already contains support for obtaining and representing a
>>> stack trace in a StackTrace object, and storing it in the StackDepot which
>>> “efficiently stores huge amounts of stack traces”. This is in the
>>> sanitizer_common subdirectory and the support is shared by ASAN and
>>> ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
>>> where each StackTrace is assigned a unique id. ASAN stores this id in the
>>> alloc_context_id field in each ChunkHeader (in the redzone preceding each
>>> allocation). Additionally, there is support for symbolizing and printing
>>> StackTrace objects.
>>> ChunkHeader
>>>
>>> The heap profiler needs to track several pieces of information for each
>>> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve
>>> a minimum of 32-byte alignment by holding this information in a 32-byte
>>> header block preceding each allocation.
>>>
>>>
>>>
>>> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It
>>> contains information about the current allocation state, user requested
>>> size, allocation and free thread ids, the allocation context id
>>> (representing the call stack at allocation, assigned by the StackDepot as
>>> described above), and misc other bookkeeping. For heap profiling, this will
>>> be converted to a 32-byte header block.
>>>
>>>
>>>
>>> Note that we could instead use the metadata section, similar to other
>>> sanitizers, which is stored in a separate location. However, as described
>>> above, storing the header block with each allocation enables 32-byte
>>> alignment without aliasing shadow counters for the same 64 bytes of memory.
>>>
>>>
>>>
>>> In the prototype heap profiler implementation, the header contains the
>>> following fields:
>>>
>>>
>>>
>>> // Should be 32 bytes
>>>
>>> struct ChunkHeader {
>>>
>>>   // 1-st 4 bytes
>>>
>>>   // Carry over from ASAN (available, allocated, quarantined). Will be
>>>
>>>   // reduced to 1 bit (available or allocated).
>>>
>>>   u32 chunk_state       : 8;
>>>
>>>   // Carry over from ASAN. Used to determine the start of user
>>> allocation.
>>>
>>>   u32 from_memalign     : 1;
>>>
>>>   // 23 bits available
>>>
>>>
>>>
>>>   // 2-nd 4 bytes
>>>
>>>   // Carry over from ASAN (comment copied verbatim).
>>>
>>>   // This field is used for small sizes. For large sizes it is equal to
>>>
>>>   // SizeClassMap::kMaxSize and the actual size is stored in the
>>>
>>>   // SecondaryAllocator's metadata.
>>>
>>>   u32 user_requested_size : 29;
>>>
>>>
>>>
>>>   // 3-rd 4 bytes
>>>
>>>   u32 cpu_id; // Allocation cpu id
>>>
>>>
>>>
>>>   // 4-th 4 bytes
>>>
>>>   // Allocation timestamp in ms from a baseline timestamp computed at
>>>
>>>   // the start of profiling (to keep this within 32 bits).
>>>
>>>   u32 timestamp_ms;
>>>
>>>
>>>
>>>   // 5-th and 6-th 4 bytes
>>>
>>>   // Carry over from ASAN. Used to identify allocation stack trace.
>>>
>>>   u64 alloc_context_id;
>>>
>>>
>>>
>>>   // 7-th and 8-th 4 bytes
>>>
>>>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>>>
>>>   u64 data_type_id; // hash of type name
>>>
>>> };
>>>
>>> As noted, the chunk state can be reduced to a single bit (no need for
>>> quarantined memory in the heap profiler). The header contains a placeholder
>>> for the data type hash, which is not yet implemented as it needs
>>> instrumentation and IR support.
>>> Heap Info Block (HIB)
>>>
>>> On a deallocation, information from the corresponding shadow block(s)
>>> and header are recorded in a Heap Info Block (HIB) object. The access count
>>> is computed from the shadow memory locations for the allocation, as well as
>>> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero
>>> 8-byte shadow locations for the whole allocation). Other information such
>>> as the deallocation timestamp (for lifetime computation) and deallocation
>>> cpu id (to determine migrations) are recorded along with the information in
>>> the chunk header recorded on allocation.
>>>
>>>
>>>
>>> The prototyped HIB object tracks the following:
>>>
>>>
>>>
>>> struct HeapInfoBlock {
>>>
>>>   // Total allocations at this stack context
>>>
>>>   u32 alloc_count;
>>>
>>>   // Access count computed from all allocated 64-byte blocks (track total
>>>
>>>   // across all allocations, and the min and max).
>>>
>>>   u64 total_access_count, min_access_count, max_access_count;
>>>
>>>   // Allocated size (track total across all allocations, and the min and
>>> max).
>>>
>>>   u64 total_size;
>>>
>>>   u32 min_size, max_size;
>>>
>>>   // Lifetime (track total across all allocations, and the min and max).
>>>
>>>   u64 total_lifetime;
>>>
>>>   u32 min_lifetime, max_lifetime;
>>>
>>>   // Percent utilization of allocated 64-byte blocks (track total
>>>
>>>   // across all allocations, and the min and max). The utilization is
>>>
>>>   // defined as the percentage of 8-byte shadow counters corresponding to
>>>
>>>   // the full allocation that are non-zero.
>>>
>>>   u64 total_percent_utilized;
>>>
>>>   u32 min_percent_utilized, max_percent_utilized;
>>>
>>>   // Allocation and deallocation timestamps from the most recent merge
>>> into
>>>
>>>   // the table with this stack context.
>>>
>>>   u32 alloc_timestamp, dealloc_timestamp;
>>>
>>>   // Allocation and deallocation cpu ids from the most recent merge into
>>>
>>>   // the table with this stack context.
>>>
>>>   u32 alloc_cpu_id, dealloc_cpu_id;
>>>
>>>   // Count of allocations at this stack context that had a different
>>>
>>>   // allocation and deallocation cpu id.
>>>
>>>   u32 num_migrated_cpu;
>>>
>>>   // Number of times the lifetime of the entry being merged had its
>>> lifetime
>>>
>>>   // overlap with the previous entry merged with this stack context (by
>>>
>>>   // comparing the new alloc/dealloc timestamp with the one last
>>> recorded in
>>>
>>>   // the entry in the table.
>>>
>>>   u32 num_lifetime_overlaps;
>>>
>>>   // Number of times the alloc/dealloc cpu of the entry being merged was
>>> the
>>>
>>>   // same as that of the previous entry merged with this stack context
>>>
>>>   u32 num_same_alloc_cpu;
>>>
>>>   u32 num_same_dealloc_cpu;
>>>
>>>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation
>>> support and
>>>
>>>   // possibly IR changes.
>>>
>>>   u64 data_type_id;
>>>
>>> }
>>> HIB Table
>>>
>>> The Heap Info Block Table, which is a multi-way associative cache, holds
>>> HIB objects from deallocated objects. It is indexed by the stack allocation
>>> context id from the chunk header, and currently utilizes a simple mod with
>>> a prime number close to a power of two as the hash (because of the way the
>>> stack context ids are assigned, a mod of a power of two performs very
>>> poorly). Thus far, only 4-way associativity has been evaluated.
>>>
>>>
>>>
>>> HIB entries are added or merged into the HIB Table on each deallocation.
>>> If an entry with a matching stack alloc context id is found in the Table,
>>> the newly deallocated information is merged into the existing entry. Each
>>> HIB Table entry currently tracks the min, max and total value of the
>>> various fields for use in computing and reporting the min, max and average
>>> when the Table is ultimately dumped.
>>>
>>>
>>>
>>> If no entry with a matching stack alloc context id is found, a new entry
>>> is created. If this causes an eviction, the evicted entry is dumped
>>> immediately (by default to stderr, otherwise to a specified report file).
>>> Later post processing can merge dumped entries with the same stack alloc
>>> context id.
>>> Initialization
>>>
>>>
>>>
>>> For ASAN, an __asan_init function initializes the memory allocation
>>> tracking support, and the ASAN instrumentation pass in LLVM creates a
>>> global constructor to invoke it. The heap profiler prototype adds a new
>>> __heapprof_init function, which performs heap profile specific
>>> initialization, and the heap profile instrumentation pass calls this new
>>> init function instead by a generated global constructor. It currently
>>> additionally invokes __asan_init since we are leveraging a modified ASAN
>>> runtime. Eventually, this should be changed to initialize refactored common
>>> support.
>>>
>>>
>>>
>>> Note that __asan init is also placed in the .preinit_array when it is
>>> available, so it is invoked even earlier than global constructors.
>>> Currently, it is not possible to do this for __heapprof_init, as it calls
>>> timespec_get in order to get a baseline timestamp (as described in the
>>> ChunkHeader comments the timestamps (ms) are actually offsets from the
>>> baseline timestamp, in order to fit into 32 bits), and system calls cannot
>>> be made that early (dl_init is not complete). Since the constructor
>>> priority is 1, it should be executed early enough that there are very few
>>> allocations before it runs, and likely the best solution is to simply
>>> ignore any allocations before initialization.
>>> Dumping
>>>
>>> For the prototype, the profile is dumped as text with a compact raw
>>> format to limit its size. Ultimately it should be dumped in a more compact
>>> binary format (i.e. into a different section of the raw instrumentation
>>> based profile, with llvm-profdata performing post-processing) which is TBD.
>>> HIB Dumping
>>>
>>> As noted earlier, HIB Table entries are created as memory is
>>> deallocated. At the end of the run (or whenever dumping is requested,
>>> discussed later), HIB entries need to be created for allocations that are
>>> still live. Conveniently, the sanitizer allocator already contains a
>>> mechanism to walk through all chunks of memory it is tracking (
>>> ForEachChunk). The heap profiler simply looks for all chunks with a
>>> chunk state of allocated, and creates a HIB the same as would be done on
>>> deallocation, adding each to the table.
>>>
>>>
>>>
>>> A HIB Table mechanism for printing each entry is then invoked.
>>>
>>>
>>>
>>> By default, the dumping occurs:
>>>
>>>    - on evictions
>>>    - full table at exit (when the static Allocator object is destructed)
>>>
>>>
>>>
>>> For running in a load testing scenario, we will want to add a mechanism
>>> to provoke finalization (merging currently live allocations) and dumping of
>>> the HIB Table before exit. This would be similar to the __llvm_profile_dump
>>> facility used for normal PGO counter dumping.
>>> Stack Trace Dumping
>>>
>>> There is existing support for dumping symbolized StackTrace objects. A
>>> wrapper to dump all StackTrace objects in the StackDepot will be added.
>>> This new interface is invoked just after the HIB Table is dumped (on exit
>>> or via dumping interface).
>>> Memory Map Dumping
>>>
>>> In cases where we may want to symbolize as a post processing step, we
>>> may need the memory map (from /proc/self/smaps). Specifically, this is
>>> needed to symbolize binaries using ASLR (Address Space Layout
>>> Randomization). There is already support for reading this file and dumping
>>> it to the specified report output file (DumpProcessMap()). This is invoked
>>> when the profile output file is initialized (HIB Table construction), so
>>> that the memory map is available at the top of the raw profile.
>>> Current Status and Next Steps
>>>
>>>
>>>
>>> As mentioned earlier, I have a working prototype based on a simplified
>>> stripped down version of ASAN. My current plan is to do the following:
>>>
>>>    1. Refactor out some of the shadow setup code common between ASAN
>>>    and HWASAN into sanitizer_common.
>>>    2. Rework my prototype into a separate heapprof library in
>>>    compiler-rt, using sanitizer_common support where possible, and send
>>>    patches for review.
>>>    3. Send patches for the heap profiler instrumentation pass and
>>>    related clang options.
>>>    4. Design/implement binary profile format
>>>
>>>
>>>
>>> --
>>>
>>> Teresa Johnson |
>>>
>>>  Software Engineer |
>>>
>>>  tejohnson at google.com |
>>>
>>>
>>>
>>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200706/e374fa7e/attachment.html>


More information about the llvm-dev mailing list