[llvm-dev] RFC: Sanitizer-based Heap Profiler

Thu Jul 9 14:32:56 PDT 2020

On Thu, Jul 9, 2020 at 2:07 PM Teresa Johnson <tejohnson at google.com> wrote:

>
>
> On Thu, Jul 9, 2020 at 1:37 PM Kostya Serebryany <kcc at google.com> wrote:
>
>>
>>
>>
>> On Thu, Jul 9, 2020 at 12:36 PM Teresa Johnson <tejohnson at google.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Jul 8, 2020 at 6:30 PM Kostya Serebryany <kcc at google.com> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Jun 24, 2020 at 4:58 PM Teresa Johnson <tejohnson at google.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I've included an RFC for a heap profiler design I've been working on
>>>>> in conjunction with David Li. Please send any questions or feedback. For
>>>>> sanitizer folks, one area of feedback is on refactoring some of the *ASAN
>>>>> shadow setup code (see the Shadow Memory section).
>>>>>
>>>>
>>>> I have no strong opinion on the refactoring of the shadow set up code.
>>>> Getting such code in one common place may or may not make the future
>>>> maintenance cheaper.
>>>>
>>>
>>> It was partially motivated by my looking to see what version of shadow
>>> setup to use, as they initially looked somewhat different, only to discover
>>> after carefully tracing through all the values that they were essentially
>>> identical but just structured differently (at least between asan and
>>> hwasan). This is somewhat philosophical, but to me it seems better to
>>> refactor when possible.
>>>
>>>>
>>>> One question about granularity: tcmalloc's allocation granularity is 8,
>>>> so by using 64- or 32- byte granules you lose some of the precision.
>>>>
>>>
>>> It depends what you mean by precision. For a single allocation, I end up
>>> accumulating the access counts across all corresponding shadows (i.e. if
>>> >64 bytes). Although I also added in a measure of utilization, which is at
>>> the less precise 64-byte granularity. I'm aligning everything to 32 bytes,
>>> which with the 32-byte header means there is no aliasing of shadows between
>>> different allocations. That being said...
>>>
>>
>> I wonder if we lose something by using a malloc that is different from
>> the production one
>> (8 byte aligned / no header vs 32-byte aligned with 32-byte header).
>> I mean, you will get correct results about individual allocations, but
>> you will lose the information
>>  about the cases when up to 8 allocations share a cache line.
>>
>
> Just to clarify - what information are we losing? We will get individual
> access counts for each of those 8 allocations, which will now be on
> separate 64-byte lines. We aren't trying to detect the cache line sharing.
>

Yes, that's what I meant.
And it's fine to have false sharing analysis as an explicit non-goal.

> Unless one wanted to identify some info about false sharing or the like
> (which we weren't trying to measure, but I see how that could be
> interesting for other purposes).
>
>
>> (It's probably outside of your goals anyway, so probably ignore this)
>>
>>
>>
>>>
>>>
>>>> How important is that?
>>>> Another option is to use 8- or 16-byte granularity and 4-byte
>>>> (saturated?) counters.
>>>>
>>>
>>> We've discussed using smaller granularities with saturating math, along
>>> with sampling to reduce the saturation.
>>>
>>>
>>>> Lots of options to play with -- which is why my recommendation is to
>>>> start with a callback-based instrumentation (just use tsan compiler pass
>>>> for that),
>>>> figure out exactly what you need and then either implement the one best
>>>> instrumentation inline, or force the callbacks to be inlined with
>>>> (Thin)LTO.
>>>>
>>>
>>> I'm not convinced yet that using callbacks makes this difficult to play
>>> with, as it can be configured pretty easily from options, and the inline
>>> instrumentation part has been pretty trivial thus far. We'd also have to
>>> build in the counter update consolidation in either case (either with less
>>> counter updates or with fewer/aggregated callbacks).
>>>
>>
>> I am probably just nitpicking about callbacks. It's not hard either way.
>> It's just that I've been using the tsan instrumentation as the basis of
>> multiple similar experiments
>> and found it to be easier than hacking the compiler for N-th time.
>>
>>
>>>
>>>
>>>> OTOH, with smaller granularity you lose most of the "counter update
>>>> consolidation" optimization.
>>>>
>>>
>>> Some of it (at least where we would consolidate updates to different
>>> fields using alignment guarantees). But we'd still be able to aggregate
>>> across potentially aliasing calls, including out of loop bodies, where the
>>> updates are to the same field.
>>>
>>>
>>>>
>>>> Eagerly looking forward to using this profiler!
>>>>
>>>
>>> Thanks!
>>> Teresa
>>>
>>>
>>>>
>>>> --kcc
>>>>
>>>>>
>>>>> Thanks,
>>>>> Teresa
>>>>>
>>>>>
>>>>> RFC: Sanitizer-based Heap Profiler
>>>>> Summary
>>>>>
>>>>> This document provides an overview of an LLVM Sanitizer-based heap
>>>>> profiler design.
>>>>> Motivation
>>>>>
>>>>> The objective of heap memory profiling is to collect critical runtime
>>>>> information associated with heap memory references and information on heap
>>>>> memory allocations. The profile information will be used first for tooling,
>>>>> and subsequently to guide the compiler optimizer and allocation runtime to
>>>>> layout heap objects with improved spatial locality. As a  result, DTLB and
>>>>> cache utilization will be improved, and program IPC (performance) will be
>>>>> increased due to reduced TLB and cache misses. More details on the heap
>>>>> profile guided optimizations will be shared in the future.
>>>>> Overview
>>>>>
>>>>> The profiler is based on compiler inserted instrumentation of load and
>>>>> store accesses, and utilizes runtime support to monitor heap allocations
>>>>> and profile data. The target consumer of the heap memory profile
>>>>> information is initially tooling and ultimately automatic data layout
>>>>> optimizations performed by the compiler and/or allocation runtime (with the
>>>>> support of new allocation runtime APIs).
>>>>>
>>>>> Each memory address is mapped to Shadow Memory
>>>>> <https://en.wikipedia.org/wiki/Shadow_memory>, similar to the
>>>>> approach used by the Address Sanitizer
>>>>> <https://github.com/google/sanitizers/wiki/AddressSanitizer> (ASAN).
>>>>> Unlike ASAN, which maps each 8 bytes of memory to 1 byte of shadow, the
>>>>> heap profiler maps 64 bytes of memory to 8 bytes of shadow. The shadow
>>>>> location implements the profile counter (incremented on accesses to the
>>>>> corresponding memory). This granularity was chosen to help avoid counter
>>>>> overflow, but it may be possible to consider mapping 32-bytes to 4 bytes.
>>>>> To avoid aliasing of shadow memory for different allocations, we must
>>>>> choose a minimum alignment carefully. As discussed further below, we can
>>>>> attain a 32-byte minimum alignment, instead of a 64-byte alignment, by
>>>>> storing necessary heap information for each allocation in a 32-byte header
>>>>> block.
>>>>>
>>>>> The compiler instruments each load and store to increment the
>>>>> associated shadow memory counter, in order to determine hotness.
>>>>>
>>>>> The heap profiler runtime is responsible for tracking allocations and
>>>>> deallocations, including the stack at each allocation, and information such
>>>>> as the allocation size and other statistics. I have implemented a prototype
>>>>> built using a stripped down and modified version of ASAN, however this will
>>>>> be a separate library utilizing sanitizer_common components.
>>>>> Compiler
>>>>>
>>>>> A simple HeapProfiler instrumentation pass instruments interesting
>>>>> memory accesses (loads, stores, atomics), with a simple load, increment,
>>>>> store of the associated shadow memory location (computed via a mask and
>>>>> shift to do the mapping of 64 bytes to 8 byte shadow, and add of the shadow
>>>>> offset). The handling is very similar to and based off of the ASAN
>>>>> instrumentation pass, with slightly different instrumentation code.
>>>>>
>>>>> Various techniques can be used to reduce the overhead, by aggressively
>>>>> coalescing counter updates (e.g. given the 32-byte alignment, accesses
>>>>> known to be in the same 32-byte block, or across possible aliases since we
>>>>> don’t care about the dereferenced values).
>>>>>
>>>>> Additionally, the Clang driver needs to set up to link with the
>>>>> runtime library, much as it does with the sanitizers.
>>>>>
>>>>> A -fheapprof option is added to enable the instrumentation pass and
>>>>> runtime library linking. Similar to -fprofile-generate, -fheapprof
>>>>> will accept an argument specifying the directory in which to write the
>>>>> profile.
>>>>> Runtime
>>>>>
>>>>> The heap profiler runtime is responsible for tracking and reporting
>>>>> information about heap allocations and accesses, aggregated by allocation
>>>>> calling context. For example, the hotness, lifetime, and cpu affinity.
>>>>>
>>>>> A new heapprof library will be created within compiler-rt. It will
>>>>> leverage support within sanitizer_common, which already contains facilities
>>>>> like stack context tracking, needed by the heap profiler.
>>>>> Shadow Memory
>>>>>
>>>>> There are some basic facilities in sanitizer_common for mmap’ing the
>>>>> shadow memory, but most of the existing setup lives in the ASAN and HWASAN
>>>>> libraries. In the case of ASAN, there is support for both statically
>>>>> assigned shadow offsets (the default on most platforms), and for
>>>>> dynamically assigned shadow memory (implemented for Windows and currently
>>>>> also used for Android and iOS). According to kcc, recent experiments show
>>>>> that the performance with a dynamic shadow is close to that with a static
>>>>> mapping. In fact, that is the only approach currently used by HWASAN. Given
>>>>> the simplicity, the heap profiler will be implemented with a dynamic shadow
>>>>> as well.
>>>>>
>>>>> There are a number of functions in ASAN and HWASAN related to setup of
>>>>> the shadow that are duplicated but very nearly identical, at least for
>>>>> linux (which seems to be the only OS flavor currently supported for
>>>>> HWASAN). E.g. ReserveShadowMemoryRange, ProtectGap, and
>>>>> FindDynamicShadowStart (in ASAN there is another nearly identical copy in
>>>>> PremapShadow, used by Android, whereas in HW ASAN the premap handling is
>>>>> already commoned with the non-premap handling). Rather than make yet
>>>>> another copy of these mechanisms, I propose refactoring them into
>>>>> sanitizer_common versions. Like HWASAN, the initial version of the heap
>>>>> profiler will be supported for linux only, but other OSes can be added as
>>>>> needed similar to ASAN.
>>>>> StackTrace and StackDepot
>>>>>
>>>>> The sanitizer already contains support for obtaining and representing
>>>>> a stack trace in a StackTrace object, and storing it in the StackDepot
>>>>> which “efficiently stores huge amounts of stack traces”. This is in the
>>>>> sanitizer_common subdirectory and the support is shared by ASAN and
>>>>> ThreadSanitizer. The StackDepot is essentially an unbounded hash table,
>>>>> where each StackTrace is assigned a unique id. ASAN stores this id in the
>>>>> alloc_context_id field in each ChunkHeader (in the redzone preceding each
>>>>> allocation). Additionally, there is support for symbolizing and printing
>>>>> StackTrace objects.
>>>>> ChunkHeader
>>>>>
>>>>> The heap profiler needs to track several pieces of information for
>>>>> each allocation. Given the mapping of 64-bytes to 8-bytes shadow, we can
>>>>> achieve a minimum of 32-byte alignment by holding this information in a
>>>>> 32-byte header block preceding each allocation.
>>>>>
>>>>> In ASAN, each allocation is preceded by a 16-byte ChunkHeader. It
>>>>> contains information about the current allocation state, user requested
>>>>> size, allocation and free thread ids, the allocation context id
>>>>> (representing the call stack at allocation, assigned by the StackDepot as
>>>>> described above), and misc other bookkeeping. For heap profiling, this will
>>>>> be converted to a 32-byte header block.
>>>>>
>>>>> Note that we could instead use the metadata section, similar to other
>>>>> sanitizers, which is stored in a separate location. However, as described
>>>>> above, storing the header block with each allocation enables 32-byte
>>>>> alignment without aliasing shadow counters for the same 64 bytes of memory.
>>>>>
>>>>> In the prototype heap profiler implementation, the header contains the
>>>>> following fields:
>>>>>
>>>>> // Should be 32 bytes
>>>>>
>>>>> struct ChunkHeader {
>>>>>
>>>>>   // 1-st 4 bytes
>>>>>
>>>>>   // Carry over from ASAN (available, allocated, quarantined). Will be
>>>>>
>>>>>   // reduced to 1 bit (available or allocated).
>>>>>
>>>>>   u32 chunk_state       : 8;
>>>>>
>>>>>   // Carry over from ASAN. Used to determine the start of user
>>>>> allocation.
>>>>>
>>>>>   u32 from_memalign     : 1;
>>>>>
>>>>>   // 23 bits available
>>>>>
>>>>>   // 2-nd 4 bytes
>>>>>
>>>>>   // Carry over from ASAN (comment copied verbatim).
>>>>>
>>>>>   // This field is used for small sizes. For large sizes it is equal to
>>>>>
>>>>>   // SizeClassMap::kMaxSize and the actual size is stored in the
>>>>>
>>>>>   // SecondaryAllocator's metadata.
>>>>>
>>>>>   u32 user_requested_size : 29;
>>>>>
>>>>>   // 3-rd 4 bytes
>>>>>
>>>>>   u32 cpu_id; // Allocation cpu id
>>>>>
>>>>>   // 4-th 4 bytes
>>>>>
>>>>>   // Allocation timestamp in ms from a baseline timestamp computed at
>>>>>
>>>>>   // the start of profiling (to keep this within 32 bits).
>>>>>
>>>>>   u32 timestamp_ms;
>>>>>
>>>>>   // 5-th and 6-th 4 bytes
>>>>>
>>>>>   // Carry over from ASAN. Used to identify allocation stack trace.
>>>>>
>>>>>   u64 alloc_context_id;
>>>>>
>>>>>   // 7-th and 8-th 4 bytes
>>>>>
>>>>>   // UNIMPLEMENTED in prototype - needs instrumentation and IR support.
>>>>>
>>>>>   u64 data_type_id; // hash of type name
>>>>>
>>>>> };
>>>>>
>>>>> As noted, the chunk state can be reduced to a single bit (no need for
>>>>> quarantined memory in the heap profiler). The header contains a placeholder
>>>>> for the data type hash, which is not yet implemented as it needs
>>>>> instrumentation and IR support.
>>>>> Heap Info Block (HIB)
>>>>>
>>>>> On a deallocation, information from the corresponding shadow block(s)
>>>>> and header are recorded in a Heap Info Block (HIB) object. The access count
>>>>> is computed from the shadow memory locations for the allocation, as well as
>>>>> the percentage of accessed 64-byte blocks (i.e. the percentage of non-zero
>>>>> 8-byte shadow locations for the whole allocation). Other information such
>>>>> as the deallocation timestamp (for lifetime computation) and deallocation
>>>>> cpu id (to determine migrations) are recorded along with the information in
>>>>> the chunk header recorded on allocation.
>>>>>
>>>>> The prototyped HIB object tracks the following:
>>>>>
>>>>> struct HeapInfoBlock {
>>>>>
>>>>>   // Total allocations at this stack context
>>>>>
>>>>>   u32 alloc_count;
>>>>>
>>>>>   // Access count computed from all allocated 64-byte blocks (track
>>>>> total
>>>>>
>>>>>   // across all allocations, and the min and max).
>>>>>
>>>>>   u64 total_access_count, min_access_count, max_access_count;
>>>>>
>>>>>   // Allocated size (track total across all allocations, and the min
>>>>> and max).
>>>>>
>>>>>   u64 total_size;
>>>>>
>>>>>   u32 min_size, max_size;
>>>>>
>>>>>   // Lifetime (track total across all allocations, and the min and
>>>>> max).
>>>>>
>>>>>   u64 total_lifetime;
>>>>>
>>>>>   u32 min_lifetime, max_lifetime;
>>>>>
>>>>>   // Percent utilization of allocated 64-byte blocks (track total
>>>>>
>>>>>   // across all allocations, and the min and max). The utilization is
>>>>>
>>>>>   // defined as the percentage of 8-byte shadow counters corresponding
>>>>> to
>>>>>
>>>>>   // the full allocation that are non-zero.
>>>>>
>>>>>   u64 total_percent_utilized;
>>>>>
>>>>>   u32 min_percent_utilized, max_percent_utilized;
>>>>>
>>>>>   // Allocation and deallocation timestamps from the most recent merge
>>>>> into
>>>>>
>>>>>   // the table with this stack context.
>>>>>
>>>>>   u32 alloc_timestamp, dealloc_timestamp;
>>>>>
>>>>>   // Allocation and deallocation cpu ids from the most recent merge
>>>>> into
>>>>>
>>>>>   // the table with this stack context.
>>>>>
>>>>>   u32 alloc_cpu_id, dealloc_cpu_id;
>>>>>
>>>>>   // Count of allocations at this stack context that had a different
>>>>>
>>>>>   // allocation and deallocation cpu id.
>>>>>
>>>>>   u32 num_migrated_cpu;
>>>>>
>>>>>   // Number of times the lifetime of the entry being merged had its
>>>>> lifetime
>>>>>
>>>>>   // overlap with the previous entry merged with this stack context (by
>>>>>
>>>>>   // comparing the new alloc/dealloc timestamp with the one last
>>>>> recorded in
>>>>>
>>>>>   // the entry in the table.
>>>>>
>>>>>   u32 num_lifetime_overlaps;
>>>>>
>>>>>   // Number of times the alloc/dealloc cpu of the entry being merged
>>>>> was the
>>>>>
>>>>>   // same as that of the previous entry merged with this stack context
>>>>>
>>>>>   u32 num_same_alloc_cpu;
>>>>>
>>>>>   u32 num_same_dealloc_cpu;
>>>>>
>>>>>   // Hash of type name (UNIMPLEMENTED). This needs instrumentation
>>>>> support and
>>>>>
>>>>>   // possibly IR changes.
>>>>>
>>>>>   u64 data_type_id;
>>>>>
>>>>> }
>>>>> HIB Table
>>>>>
>>>>> The Heap Info Block Table, which is a multi-way associative cache,
>>>>> holds HIB objects from deallocated objects. It is indexed by the stack
>>>>> allocation context id from the chunk header, and currently utilizes a
>>>>> simple mod with a prime number close to a power of two as the hash (because
>>>>> of the way the stack context ids are assigned, a mod of a power of two
>>>>> performs very poorly). Thus far, only 4-way associativity has been
>>>>> evaluated.
>>>>>
>>>>> HIB entries are added or merged into the HIB Table on each
>>>>> deallocation. If an entry with a matching stack alloc context id is found
>>>>> in the Table, the newly deallocated information is merged into the existing
>>>>> entry. Each HIB Table entry currently tracks the min, max and total value
>>>>> of the various fields for use in computing and reporting the min, max and
>>>>> average when the Table is ultimately dumped.
>>>>>
>>>>> If no entry with a matching stack alloc context id is found, a new
>>>>> entry is created. If this causes an eviction, the evicted entry is dumped
>>>>> immediately (by default to stderr, otherwise to a specified report file).
>>>>> Later post processing can merge dumped entries with the same stack alloc
>>>>> context id.
>>>>> Initialization
>>>>>
>>>>> For ASAN, an __asan_init function initializes the memory allocation
>>>>> tracking support, and the ASAN instrumentation pass in LLVM creates a
>>>>> global constructor to invoke it. The heap profiler prototype adds a new
>>>>> __heapprof_init function, which performs heap profile specific
>>>>> initialization, and the heap profile instrumentation pass calls this new
>>>>> init function instead by a generated global constructor. It currently
>>>>> additionally invokes __asan_init since we are leveraging a modified ASAN
>>>>> runtime. Eventually, this should be changed to initialize refactored common
>>>>> support.
>>>>>
>>>>> Note that __asan init is also placed in the .preinit_array when it is
>>>>> available, so it is invoked even earlier than global constructors.
>>>>> Currently, it is not possible to do this for __heapprof_init, as it calls
>>>>> timespec_get in order to get a baseline timestamp (as described in the
>>>>> ChunkHeader comments the timestamps (ms) are actually offsets from the
>>>>> baseline timestamp, in order to fit into 32 bits), and system calls cannot
>>>>> be made that early (dl_init is not complete). Since the constructor
>>>>> priority is 1, it should be executed early enough that there are very few
>>>>> allocations before it runs, and likely the best solution is to simply
>>>>> ignore any allocations before initialization.
>>>>> Dumping
>>>>>
>>>>> For the prototype, the profile is dumped as text with a compact raw
>>>>> format to limit its size. Ultimately it should be dumped in a more compact
>>>>> binary format (i.e. into a different section of the raw instrumentation
>>>>> based profile, with llvm-profdata performing post-processing) which is TBD.
>>>>> HIB Dumping
>>>>>
>>>>> As noted earlier, HIB Table entries are created as memory is
>>>>> deallocated. At the end of the run (or whenever dumping is requested,
>>>>> discussed later), HIB entries need to be created for allocations that are
>>>>> still live. Conveniently, the sanitizer allocator already contains a
>>>>> mechanism to walk through all chunks of memory it is tracking (
>>>>> ForEachChunk). The heap profiler simply looks for all chunks with a
>>>>> chunk state of allocated, and creates a HIB the same as would be done on
>>>>> deallocation, adding each to the table.
>>>>>
>>>>> A HIB Table mechanism for printing each entry is then invoked.
>>>>>
>>>>> By default, the dumping occurs:
>>>>>
>>>>>    -
>>>>>
>>>>>    on evictions
>>>>>    -
>>>>>
>>>>>    full table at exit (when the static Allocator object is destructed)
>>>>>
>>>>>
>>>>> For running in a load testing scenario, we will want to add a
>>>>> mechanism to provoke finalization (merging currently live allocations) and
>>>>> dumping of the HIB Table before exit. This would be similar to the
>>>>> __llvm_profile_dump facility used for normal PGO counter dumping.
>>>>> Stack Trace Dumping
>>>>>
>>>>> There is existing support for dumping symbolized StackTrace objects. A
>>>>> wrapper to dump all StackTrace objects in the StackDepot will be added.
>>>>> This new interface is invoked just after the HIB Table is dumped (on exit
>>>>> or via dumping interface).
>>>>> Memory Map Dumping
>>>>>
>>>>> In cases where we may want to symbolize as a post processing step, we
>>>>> may need the memory map (from /proc/self/smaps). Specifically, this is
>>>>> needed to symbolize binaries using ASLR (Address Space Layout
>>>>> Randomization). There is already support for reading this file and dumping
>>>>> it to the specified report output file (DumpProcessMap()). This is invoked
>>>>> when the profile output file is initialized (HIB Table construction), so
>>>>> that the memory map is available at the top of the raw profile.
>>>>> Current Status and Next Steps
>>>>>
>>>>> As mentioned earlier, I have a working prototype based on a simplified
>>>>> stripped down version of ASAN. My current plan is to do the following:
>>>>>
>>>>>    1.
>>>>>
>>>>>    Refactor out some of the shadow setup code common between ASAN and
>>>>>    HWASAN into sanitizer_common.
>>>>>    2.
>>>>>
>>>>>    Rework my prototype into a separate heapprof library in
>>>>>    compiler-rt, using sanitizer_common support where possible, and send
>>>>>    patches for review.
>>>>>    3.
>>>>>
>>>>>    Send patches for the heap profiler instrumentation pass and
>>>>>    related clang options.
>>>>>    4.
>>>>>
>>>>>    Design/implement binary profile format
>>>>>
>>>>>
>>>>> --
>>>>> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>>>>>
>>>>
>>>
>>> --
>>> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>>>
>>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200709/0eb4e822/attachment-0001.html>