[llvm-dev] RFC: Sanitizer-based Heap Profiler

Kostya Serebryany via llvm-dev llvm-dev at lists.llvm.org
Wed Jul 8 18:21:04 PDT 2020


On Mon, Jul 6, 2020 at 11:49 AM Mitch Phillips via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> > I'm not aware that -fsanitizer* options disable these, but I know in our
> environment we do disable frame pointer omission when setting up ASAN
> builds, and I am arranging for heap profiling builds to do the same. Not
> sure whether we want to do this within clang itself, would be interested in
> Kostya's opinion. I can't see anywhere that we are disabling tail call
> optimizations for ASAN though, but I might have missed it.
>
> We don't force frame pointers to be emitted with -fsanitize=address at
> least -- although we highly recommend it
> <https://clang.llvm.org/docs/AddressSanitizer.html#usage> as the frame
> pointer unwinder is much faster than DWARF, particularly important for
> stack collection on malloc/free.
>

Correct. frame pointers are a must for fast collection of stack traces, but
we decouple this requirement from ASAN.
It is perfectly valid to run ASAN w/o frame pointers but then you either
have to use exceptionally slow unwinder or you get garbled error messages.


>
> On Mon, Jul 6, 2020 at 7:59 AM Teresa Johnson via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Hi Wenlei,
>>
>> Thanks for the comments! David answered the first question, I do have
>> some comments on the second one though.
>> Teresa
>>
>> On Sun, Jul 5, 2020 at 1:44 PM Xinliang David Li <davidxl at google.com>
>> wrote:
>>
>>>
>>>
>>> On Sat, Jul 4, 2020 at 11:28 PM Wenlei He <wenlei at fb.com> wrote:
>>>
>>>> This sounds very useful. We’ve improved and used memoro
>>>> <https://www.youtube.com/watch?v=fm47XsATelI> for memory profiling and
>>>> analysis, and we are also looking for ways to leverage memory profile for
>>>> PGO/FDO. I think having a common profiling infrastructure for analysis
>>>> tooling as well as profile guided optimizations is good design, and having
>>>> it in LLVM is also helpful. Very interested in the tooling and optimization
>>>> that comes after the profiler.
>>>>
>>>>
>>>>
>>>> Two questions:
>>>>
>>>>    - How does the profiling overhead look? Is that similar to ASAN
>>>>    overhead from what you’ve seen, which would be higher than PGO
>>>>    instrumentation? Asking because I’m wondering if any PGO training setup can
>>>>    be used directly for the new heap profiling.
>>>>
>>>>
>>> It is built on top of ASAN runtime, but the overhead can be made much
>>> lower by using counter update consolidation -- all fields sharing the same
>>> shadow counter can be merged, and aggressive loop sinking/hoisting can be
>>> done.
>>>
>>> The goal is to integrate this with the PGO instrumentation. The PGO
>>> instrumentation overhead can be further reduced with sampling technique
>>> (Rong Xu has a patch to be submitted).
>>>
>>>
>>>>    -
>>>>    - I’m not familiar with how sanitizer handles stack trace, but for
>>>>    getting most accurate calling context (use FP rather than dwarf), I guess
>>>>    frame pointer omission and tail call opt etc. need to be turned off? Is
>>>>    that going to be implied by -fheapprof?
>>>>
>>>>
>>> Kostya can provide detailed answers to these questions.
>>>
>>
>> I'm not aware that -fsanitizer* options disable these, but I know in our
>> environment we do disable frame pointer omission when setting up ASAN
>> builds, and I am arranging for heap profiling builds to do the same. Not
>> sure whether we want to do this within clang itself, would be interested in
>> Kostya's opinion. I can't see anywhere that we are disabling tail call
>> optimizations for ASAN though, but I might have missed it.
>>
>> Thanks,
>> Teresa
>>
>>
>>>
>>> David
>>>
>>>>
>>>>    -
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Wenlei
>>>>
>>>>
>>>>
>>>> *From: *llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Teresa
>>>> Johnson via llvm-dev <llvm-dev at lists.llvm.org>
>>>> *Reply-To: *Teresa Johnson <tejohnson at google.com>
>>>> *Date: *Wednesday, June 24, 2020 at 4:58 PM
>>>> *To: *llvm-dev <llvm-dev at lists.llvm.org>, Kostya Serebryany <
>>>> kcc at google.com>, Evgenii Stepanov <eugenis at google.com>, Vitaly Buka <
>>>> vitalybuka at google.com>
>>>> *Cc: *David Li <davidxl at google.com>
>>>> *Subject: *[llvm-dev] RFC: Sanitizer-based Heap Profiler
>>>>
>>>>
>>>>
>>>> Hi all,
>>>>
>>>>
>>>>
>>>> I've included an RFC for a heap profiler design I've been working on in
>>>> conjunction with David Li. Please send any questions or feedback. For
>>>> sanitizer folks, one area of feedback is on refactoring some of the *ASAN
>>>> shadow setup code (see the Shadow Memory section).
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Teresa
>>>>
>>>>
>>>>
>>>> RFC: Sanitizer-based Heap Profiler
>>>> Summary
>>>>
>>>> This document provides an overview of an LLVM Sanitizer-based heap
>>>> profiler design.
>>>> Motivation
>>>>
>>>> The objective of heap memory profiling is to collect critical runtime
>>>> information associated with heap memory references and information on heap
>>>> memory allocations. The profile information will be used first for tooling,
>>>> and subsequently to guide the compiler optimizer and allocation runtime to
>>>> layout heap objects with improved spatial locality. As a  result, DTLB and
>>>> cache utilization will be improved, and program IPC (performance) will be
>>>> increased due to reduced TLB and cache misses. More details on the heap
>>>> profile guided optimizations will be shared in the future.
>>>> Overview
>>>>
>>>> The profiler is based on compiler inserted instrumentation of load and
>>>> store accesses, and utilizes runtime support to monitor heap allocations
>>>> and profile data. The target consumer of the heap memory profile
>>>> information is initially tooling and ultimately automatic data layout
>>>> optimizations performed by the compiler and/or allocation runtime (with the
>>>> support of new allocation runtime APIs).
>>>>
>>>>
>>>>
>>>> Each memory address is mapped to Shadow Memory
>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__en.wikipedia.org_wiki_Shadow-5Fmemory&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=f45oT3WLypO1yblv9KNkPd-rl8jlBp761Hhvev27S8M&s=iIirMZSYnDlGIjY8PZjJprWckHx7QhmKUQKcb1URBFY&e=>,
>>>> similar to the approach used by the Address Sanitizer
>>>> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN).
>>>> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
>>>> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow
>>>> location implements the profile counter (incremented on accesses to the
>>>> corresponding memory). This granularity was chosen to help avoid counter
>>>> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes.
>>>> To avoid aliasing of shadow memory for different allocations, we must
>>>> choose a minimum alignment carefully. As discussed further below, we can
>>>> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
>>>> storing necessary heap information for each allocation in a 32-byte header
>>>> block.
>>>>
>>>>
>>>>
>>>> The compiler instruments each load and store to increment the
>>>> associated shadow memory counter, in order to determine hotness.
>>>>
>>>>
>>>>
>>>> The heap profiler runtime is responsible for tracking allocations and
>>>> deallocations, including the stack at each allocation, and information such
>>>> as the allocation size and other statistics. I have implemented a prototype
>>>> built using a stripped down and modified version of ASAN, however this will
>>>> be a separate library utilizing sanitizer_common components.
>>>> Compiler
>>>>
>>>> A simple HeapProfiler instrumentation pass instruments interesting
>>>> memory accesses (loads, stores, atomics), with a simple load, increment,
>>>> store of the associated shadow memory location (computed via a mask and
>>>> shift to do the mapping of 64 bytes to 8 byte shadow, and add of the shadow
>>>> offset). The handling is very similar to and based off of the ASAN
>>>> instrumentation pass, with slightly different instrumentation code.
>>>>
>>>>
>>>>
>>>> Various techniques can be used to reduce the overhead, by aggressively
>>>> coalescing counter updates (e.g. given the 32-byte alignment, accesses
>>>> known to be in the same 32-byte block, or across possible aliases since we
>>>> don’t care about the dereferenced values).
>>>>
>>>>
>>>>
>>>> Additionally, the Clang driver needs to set up to link with the runtime
>>>> library, much as it does with the sanitizers.
>>>>
>>>>
>>>>
>>>> A -fheapprof option is added to enable the instrumentation pass and
>>>> runtime library linking. Similar to -fprofile-generate, -fheapprof
>>>> will accept an argument specifying the directory in which to write the
>>>> profile.
>>>> Runtime
>>>>
>>>> The heap profiler runtime is responsible for tracking and reporting
>>>> information about heap allocations and accesses, aggregated by allocation
>>>> calling context. For example, the hotness, lifetime, and cpu affinity.
>>>>
>>>>
>>>>
>>>> A new heapprof library will be created within compiler-rt. It will
>>>> leverage support within sanitizer_common, which already contains facilities
>>>> like stack context tracking, needed by the heap profiler.
>>>> Shadow Memory
>>>>
>>>> There are some basic facilities in sanitizer_common for mmap’ing the
>>>> shadow memory, but most of the existing setup lives in the ASAN and HWASAN
>>>> libraries. In the case of ASAN, there is support for both statically
>>>> assigned shadow offsets (the default on most platforms), and for
>>>> dynamically assigned shadow memory (implemented for Windows and currently
>>>> also used for Android and iOS). According to kcc, recent experiments show
>>>> that the performance with a dynamic shadow is close to that with a static
>>>> mapping. In fact, that is the only approach currently used by HWASAN. Given
>>>> the simplicity, the heap profiler will be implemented with a dynamic shadow
>>>> as well.
>>>>
>>>>
>>>>
>>>> There are a number of functions in ASAN and HWASAN related to setup of
>>>> the shadow that are duplicated but very nearly identical, at least for
>>>> linux (which seems to be the only OS flavor currently supported for
>>>> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and
>>>> FindDynamicShadowStart (in ASAN there is another nearly identical copy in
>>>> PremapShadow, used by Android, whereas in HW ASAN the premap handling is
>>>> already commoned with the non-premap handling). Rather than make yet
>>>> another copy of these mechanisms, I propose refactoring them into
>>>> sanitizer_common versions. Like HWASAN, the initial version of the heap
>>>> profiler will be supported for linux only, but other OSes can be added as
>>>> needed similar to ASAN.
>>>> StackTrace and StackDepot
>>>>
>>>> The sanitizer already contains support for obtaining and representing a
>>>> stack trace in a StackTrace object, and storing it in the StackDepot which
>>>> “efficiently stores huge amounts of stack traces”. This is in the
>>>> sanitizer_common subdirectory and the support is shared by ASAN and
>>>> ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
>>>> where each StackTrace is assigned a unique id. ASAN stores this id in the
>>>> alloc_context_id field in each ChunkHeader (in the redzone preceding each
>>>> allocation). Additionally, there is support for symbolizing and printing
>>>> StackTrace objects.
>>>> ChunkHeader
>>>>
>>>> The heap profiler needs to track several pieces of information for each
>>>> allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can achieve
>>>> a minimum of 32-byte alignment by holding this information in a 32-byte
>>>> header block preceding each allocation.
>>>>
>>>>
>>>>
>>>> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It
>>>> contains information about the current allocation state, user requested
>>>> size, allocation and free thread ids, the allocation context id
>>>> (representing the call stack at allocation, assigned by the StackDepot as
>>>> described above), and misc other bookkeeping. For heap profiling, this will
>>>> be converted to a 32-byte header block.
>>>>
>>>>
>>>>
>>>> Note that we could instead use the metadata section, similar to other
>>>> sanitizers, which is stored in a separate location. However, as described
>>>> above, storing the header block with each allocation enables 32-byte
>>>> alignment without aliasing shadow counters for the same 64 bytes of memory.
>>>>
>>>>
>>>>
>>>> In the prototype heap profiler implementation, the header contains the
>>>> following fields:
>>>>
>>>>
>>>>
>>>> // Should be 32 bytes
>>>>
>>>> struct ChunkHeader {
>>>>
>>>>   // 1-st 4 bytes
>>>>
>>>>   // Carry over from ASAN (available, allocated, quarantined). Will be
>>>>
>>>>   // reduced to 1 bit (available or allocated).
>>>>
>>>>   u32 chunk_state       : 8;
>>>>
>>>>   // Carry over from ASAN. Used to determine the start of user
>>>> allocation.
>>>>
>>>>   u32 from_memalign     : 1;
>>>>
>>>>   // 23 bits available
>>>>
>>>>
>>>>
>>>>   // 2-nd 4 bytes
>>>>
>>>>   // Carry over from ASAN (comment copied verbatim).
>>>>
>>>>   // This field is used for small sizes. For large sizes it is equal to
>>>>
>>>>   // SizeClassMap::kMaxSize and the actual size is stored in the
>>>>
>>>>   // SecondaryAllocator's metadata.
>>>>
>>>>   u32 user_requested_size : 29;
>>>>
>>>>
>>>>
>>>>   // 3-rd 4 bytes
>>>>
>>>>   u32 cpu_id; // Allocation cpu id
>>>>
>>>>
>>>>
>>>>   // 4-th 4 bytes
>>>>
>>>>   // Allocation timestamp in ms from a baseline timestamp computed at
>>>>
>>>>   // the start of profiling (to keep this within 32 bits).
>>>>
>>>>   u32 timestamp_ms;
>>>>
>>>>
>>>>
>>>>   // 5-th and 6-th 4 bytes
>>>>
>>>>   // Carry over from ASAN. Used to identify allocation stack trace.
>>>>
>>>>   u64 alloc_context_id;
>>>>
>>>>
>>>>
>>>>   // 7-th and 8-th 4 bytes
>>>>
>>>>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>>>>
>>>>   u64 data_type_id; // hash of type name
>>>>
>>>> };
>>>>
>>>> As noted, the chunk state can be reduced to a single bit (no need for
>>>> quarantined memory in the heap profiler). The header contains a placeholder
>>>> for the data type hash, which is not yet implemented as it needs
>>>> instrumentation and IR support.
>>>> Heap Info Block (HIB)
>>>>
>>>> On a deallocation, information from the corresponding shadow block(s)
>>>> and header are recorded in a Heap Info Block (HIB) object. The access count
>>>> is computed from the shadow memory locations for the allocation, as well as
>>>> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero
>>>> 8-byte shadow locations for the whole allocation). Other information such
>>>> as the deallocation timestamp (for lifetime computation) and deallocation
>>>> cpu id (to determine migrations) are recorded along with the information in
>>>> the chunk header recorded on allocation.
>>>>
>>>>
>>>>
>>>> The prototyped HIB object tracks the following:
>>>>
>>>>
>>>>
>>>> struct HeapInfoBlock {
>>>>
>>>>   // Total allocations at this stack context
>>>>
>>>>   u32 alloc_count;
>>>>
>>>>   // Access count computed from all allocated 64-byte blocks (track
>>>> total
>>>>
>>>>   // across all allocations, and the min and max).
>>>>
>>>>   u64 total_access_count, min_access_count, max_access_count;
>>>>
>>>>   // Allocated size (track total across all allocations, and the min
>>>> and max).
>>>>
>>>>   u64 total_size;
>>>>
>>>>   u32 min_size, max_size;
>>>>
>>>>   // Lifetime (track total across all allocations, and the min and max).
>>>>
>>>>   u64 total_lifetime;
>>>>
>>>>   u32 min_lifetime, max_lifetime;
>>>>
>>>>   // Percent utilization of allocated 64-byte blocks (track total
>>>>
>>>>   // across all allocations, and the min and max). The utilization is
>>>>
>>>>   // defined as the percentage of 8-byte shadow counters corresponding
>>>> to
>>>>
>>>>   // the full allocation that are non-zero.
>>>>
>>>>   u64 total_percent_utilized;
>>>>
>>>>   u32 min_percent_utilized, max_percent_utilized;
>>>>
>>>>   // Allocation and deallocation timestamps from the most recent merge
>>>> into
>>>>
>>>>   // the table with this stack context.
>>>>
>>>>   u32 alloc_timestamp, dealloc_timestamp;
>>>>
>>>>   // Allocation and deallocation cpu ids from the most recent merge into
>>>>
>>>>   // the table with this stack context.
>>>>
>>>>   u32 alloc_cpu_id, dealloc_cpu_id;
>>>>
>>>>   // Count of allocations at this stack context that had a different
>>>>
>>>>   // allocation and deallocation cpu id.
>>>>
>>>>   u32 num_migrated_cpu;
>>>>
>>>>   // Number of times the lifetime of the entry being merged had its
>>>> lifetime
>>>>
>>>>   // overlap with the previous entry merged with this stack context (by
>>>>
>>>>   // comparing the new alloc/dealloc timestamp with the one last
>>>> recorded in
>>>>
>>>>   // the entry in the table.
>>>>
>>>>   u32 num_lifetime_overlaps;
>>>>
>>>>   // Number of times the alloc/dealloc cpu of the entry being merged
>>>> was the
>>>>
>>>>   // same as that of the previous entry merged with this stack context
>>>>
>>>>   u32 num_same_alloc_cpu;
>>>>
>>>>   u32 num_same_dealloc_cpu;
>>>>
>>>>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation
>>>> support and
>>>>
>>>>   // possibly IR changes.
>>>>
>>>>   u64 data_type_id;
>>>>
>>>> }
>>>> HIB Table
>>>>
>>>> The Heap Info Block Table, which is a multi-way associative cache,
>>>> holds HIB objects from deallocated objects. It is indexed by the stack
>>>> allocation context id from the chunk header, and currently utilizes a
>>>> simple mod with a prime number close to a power of two as the hash (because
>>>> of the way the stack context ids are assigned, a mod of a power of two
>>>> performs very poorly). Thus far, only 4-way associativity has been
>>>> evaluated.
>>>>
>>>>
>>>>
>>>> HIB entries are added or merged into the HIB Table on each
>>>> deallocation. If an entry with a matching stack alloc context id is found
>>>> in the Table, the newly deallocated information is merged into the existing
>>>> entry. Each HIB Table entry currently tracks the min, max and total value
>>>> of the various fields for use in computing and reporting the min, max and
>>>> average when the Table is ultimately dumped.
>>>>
>>>>
>>>>
>>>> If no entry with a matching stack alloc context id is found, a new
>>>> entry is created. If this causes an eviction, the evicted entry is dumped
>>>> immediately (by default to stderr, otherwise to a specified report file).
>>>> Later post processing can merge dumped entries with the same stack alloc
>>>> context id.
>>>> Initialization
>>>>
>>>>
>>>>
>>>> For ASAN, an __asan_init function initializes the memory allocation
>>>> tracking support, and the ASAN instrumentation pass in LLVM creates a
>>>> global constructor to invoke it. The heap profiler prototype adds a new
>>>> __heapprof_init function, which performs heap profile specific
>>>> initialization, and the heap profile instrumentation pass calls this new
>>>> init function instead by a generated global constructor. It currently
>>>> additionally invokes __asan_init since we are leveraging a modified ASAN
>>>> runtime. Eventually, this should be changed to initialize refactored common
>>>> support.
>>>>
>>>>
>>>>
>>>> Note that __asan init is also placed in the .preinit_array when it is
>>>> available, so it is invoked even earlier than global constructors.
>>>> Currently, it is not possible to do this for __heapprof_init, as it calls
>>>> timespec_get in order to get a baseline timestamp (as described in the
>>>> ChunkHeader comments the timestamps (ms) are actually offsets from the
>>>> baseline timestamp, in order to fit into 32 bits), and system calls cannot
>>>> be made that early (dl_init is not complete). Since the constructor
>>>> priority is 1, it should be executed early enough that there are very few
>>>> allocations before it runs, and likely the best solution is to simply
>>>> ignore any allocations before initialization.
>>>> Dumping
>>>>
>>>> For the prototype, the profile is dumped as text with a compact raw
>>>> format to limit its size. Ultimately it should be dumped in a more compact
>>>> binary format (i.e. into a different section of the raw instrumentation
>>>> based profile, with llvm-profdata performing post-processing) which is TBD.
>>>> HIB Dumping
>>>>
>>>> As noted earlier, HIB Table entries are created as memory is
>>>> deallocated. At the end of the run (or whenever dumping is requested,
>>>> discussed later), HIB entries need to be created for allocations that are
>>>> still live. Conveniently, the sanitizer allocator already contains a
>>>> mechanism to walk through all chunks of memory it is tracking (
>>>> ForEachChunk). The heap profiler simply looks for all chunks with a
>>>> chunk state of allocated, and creates a HIB the same as would be done on
>>>> deallocation, adding each to the table.
>>>>
>>>>
>>>>
>>>> A HIB Table mechanism for printing each entry is then invoked.
>>>>
>>>>
>>>>
>>>> By default, the dumping occurs:
>>>>
>>>>    - on evictions
>>>>    - full table at exit (when the static Allocator object is
>>>>    destructed)
>>>>
>>>>
>>>>
>>>> For running in a load testing scenario, we will want to add a mechanism
>>>> to provoke finalization (merging currently live allocations) and dumping of
>>>> the HIB Table before exit. This would be similar to the __llvm_profile_dump
>>>> facility used for normal PGO counter dumping.
>>>> Stack Trace Dumping
>>>>
>>>> There is existing support for dumping symbolized StackTrace objects. A
>>>> wrapper to dump all StackTrace objects in the StackDepot will be added.
>>>> This new interface is invoked just after the HIB Table is dumped (on exit
>>>> or via dumping interface).
>>>> Memory Map Dumping
>>>>
>>>> In cases where we may want to symbolize as a post processing step, we
>>>> may need the memory map (from /proc/self/smaps). Specifically, this is
>>>> needed to symbolize binaries using ASLR (Address Space Layout
>>>> Randomization). There is already support for reading this file and dumping
>>>> it to the specified report output file (DumpProcessMap()). This is invoked
>>>> when the profile output file is initialized (HIB Table construction), so
>>>> that the memory map is available at the top of the raw profile.
>>>> Current Status and Next Steps
>>>>
>>>>
>>>>
>>>> As mentioned earlier, I have a working prototype based on a simplified
>>>> stripped down version of ASAN. My current plan is to do the following:
>>>>
>>>>    1. Refactor out some of the shadow setup code common between ASAN
>>>>    and HWASAN into sanitizer_common.
>>>>    2. Rework my prototype into a separate heapprof library in
>>>>    compiler-rt, using sanitizer_common support where possible, and send
>>>>    patches for review.
>>>>    3. Send patches for the heap profiler instrumentation pass and
>>>>    related clang options.
>>>>    4. Design/implement binary profile format
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Teresa Johnson |
>>>>
>>>>  Software Engineer |
>>>>
>>>>  tejohnson at google.com |
>>>>
>>>>
>>>>
>>>
>>
>> --
>> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200708/b0f3d543/attachment-0001.html>


More information about the llvm-dev mailing list