[llvm-dev] RFC: A binary serialization format for MemProf
Xinliang David Li via llvm-dev
llvm-dev at lists.llvm.org
Wed Sep 29 17:46:31 PDT 2021
FYI you can also view the RFC here
https://groups.google.com/g/llvm-dev/c/h1DvHguLpxU , which displays the
diagrams better (without extra wide space).
David
On Wed, Sep 29, 2021 at 3:17 PM Snehasish Kumar <snehasishk at google.com>
wrote:
> This RFC contains the following:
>
> * Proposal to introduce a new raw binary serialization format for heap
> allocation profiles
>
> * Proposal to extend the PGO indexed format to hold heap allocation
> profiles
> We look forward to your feedback on the proposals.
> Authors: snehasishk at google.com, davidxl at google.com, tejohnson at google.com
>
> Introduction
>
> —-----------
>
> The design of a sanitizer-based heap profiler (MemProf) was shared with
> llvm-dev in Jun 2020 [1]. Since then Teresa (@tejohnson) has added a
> sanitizer based heap profiler which can be enabled with -fmemory-profile.
> Today it writes out the data in a text format which can be inspected by
> users. We have used this to drive analyses of heap behaviour at Google.
> This RFC shares details on a binary serialization format for heap profiling
> data which can then be reused by the compiler to guide optimizations
> similar to traditional PGO.
>
> Similar to the existing instrumentation based PGO, the binary heap profile
> data for PGHO has two forms. One is the raw data format that is used by the
> profiler runtime, and the other is the indexed profile format used by the
> compiler. The profile data with the indexed profile data format will be
> generated by llvm-profdata from the raw profile data offline. This allows a
> single binary profile file to hold the PGO and Memprof profiling data. Fig
> 1 below shows the binary format generation and use.
>
>
> ┌──────────────────┐ Raw Profile ┌───────────────┐ Indexed Profile
> (v8)
>
> │ Compiler Runtime ├─────────────► llvm-profdata ├───► with Memprof data
> ───► -fprofile-use
>
> └──────────────────┘ └───────────────┘
>
> Fig 1: Memprof binary profile lifecycle
>
> Raw Binary Format
>
> —----------------
>
> The raw binary format contains 4 sections
>
> 1. Memprof raw profile header
>
> 2. Memory mapping layout
>
> 3. Memory Info Block (MIB) records
>
> 4. Call stack information
>
> +----------------------+
>
> | Magic |
>
> +----------------------+ H
>
> | Version | E
>
> +----------------------+ A
>
> | Total Size | D
>
> +----------------------+ E
>
> +---------| Map Offset | R
>
> | +----------------------+
>
> | +-----| MIB Offset |
>
> | | +----------------------+
>
> | | | Call Stack Offset |---------+
>
> | | +----------------------+ |
>
> +-------->| Number of | M |
>
> | | Map Entries | A |
>
> | +----------------------+ P |
>
> | | | |
>
> | | Map Entry | S |
>
> | +----------------------+ E |
>
> | | ... | C |
>
> | +----------------------+ T |
>
> | | | I |
>
> | | Map Entry | O |
>
> | +----------------------+ N |
>
> | |
>
> | |
>
> | +----------------------+ |
>
> +---->| Number of | M |
>
> | MIB Entries | I |
>
> +----------------------+ B |
>
> | | |
>
> | MIB Entry | S |
>
> +----------------------+ E |
>
> | ... | C |
>
> +----------------------+ T |
>
> | | I |
>
> | MIB Entry | O |
>
> +----------------------+ N |
>
> |
>
> +----------------------+ |
>
> | Number of |<--------+
>
> | Call Stack Entries | S S
>
> +----------------------+ T E
>
> | Call Stack Entry | A C
>
> +----------------------+ C T
>
> | ... | K I
>
> +----------------------+ O
>
> | Call Stack Entry | N
>
> +----------------------+
>
> Fig 2: Memprof Raw Format
>
>
> * Memprof raw profile header
>
> The header consists of a unique identifier, version number, total size of
> the profile as well as offset into the profile for each of the following
> sections - memory mapping layout, memprof profile entries and call stack
> information. We do not intend to maintain backwards compatibility for the
> raw binary format.
>
> * Memory mapping layout
>
> The process memory mappings for the executable segment during profiling
> are stored in this section. This allows symbolization during post
> processing for binaries which are built with position independent code. For
> now all read only, executable mappings are recorded, however in the
> future, mappings for heap data can also potentially be stored. For each
> mapping, we record
>
> -
>
> Start (virtual address)
> -
>
> End (virtual address)
> -
>
> Offset (from the file it was mmap-ed from)
> -
>
> buildId (linker generated hash -Wl,build-id
>
>
> * MIB Records
>
> The profile information we collect is currently defined in [2]. The
> section begins with a 8 byte entry containing the number of profile entries
> in this section. Each entry is uniquely identified by the dynamic calling
> context of the allocation site. For each entry, various metrics such as
> access counts, allocation sizes, object lifetimes etc are computed via
> profiling. This section may contain multiple entries identified by the same
> callstack id. Subsequent processing to convert and merge multiple raw
> profiles will deduplicate any such entries.
>
> * Call stack information
> Each memprof profile entry is uniquely identified by its dynamic calling
> context. In this section we record the identifier and its corresponding
> call stack trace. We use the sanitizer stack depot provided identifier and
> serialize the trace for each without deduplication. The section begins with
> a 8b entry containing the number of call stack entries. Each call stack
> entry contains a 8b field which denotes how many contexts are recorded in
> this entry. Each frame is identified by an 8b program counter address which
> holds the call instruction virtual address - 1. Further deduplication is
> possible though we do not do so at this time.
>
> * Raw Profile Characteristics
>
> To understand the characteristics of raw binary profiles generated by
> Memprof, we experimented with a clang bootstrap build. We ran 500
> invocations of Memprof-instrumented clang on clang source code. Each
> invocation produced a raw binary profile and we present some aggregate
> information about them below:
>
> +---------------------------------+--------+---------+---------+
>
> | | Min | Median | Max |
>
> +---------------------------------+--------+---------+---------+
>
> | Unique allocation contexts | 940 | 10661 | 35355 |
>
> +---------------------------------+--------+---------+---------+
>
> | MIB records size (bytes) | 101528 | 1151396 | 3818348 |
>
> +---------------------------------+--------+---------+---------+
>
> | Call stack section size (bytes) | 31080 | 419048 | 1439144 |
>
> +---------------------------------+--------+---------+---------+
>
> | Total Size (bytes) | 133336 | 1571680 | 5258220 |
>
> +---------------------------------+--------+---------+---------+
>
> The size of the header is 48 bytes and the number of read only executable
> memory maps is usually small (12 in this case) with each map entry
> consuming 64 bytes. We find that the raw profiles are highly compressible,
> on average files are compressed by 81%. The largest raw profile in the
> dataset is ~5M in size. It is compressed to 975K using zip on default
> settings. In contrast for the same clang build, the instrumented PGO raw
> profile is ~21M in size (zip compressed 73%). Note that the Memprof profile
> size is proportional to the number of allocation contexts during
> profiling.
>
> Since these are profiles from individual invocations, they must be merged
> before use. This is performed implicitly during the conversion to indexed
> profile format by llvm-profdata. MIBs are merged based on their call stack.
>
> Memprof extensions for the indexed profile format
>
> —------------------------------------------------
>
> +----------------+
>
> | MAGIC |
>
> +----------------+
>
> | VERSION |
>
> +----------------+
>
> | HASHTYPE |
>
> +--+----------+--+
>
> |HASHTAB OFFSET |-------+
>
> +--+----------+--+ |
>
> +----------------+ |
>
> | | |
>
> | PROFILE | |
>
> | SUMMARY | |
>
> | DATA | |
>
> +----------------+ |
>
> +----------------+ <----+
>
> | |
>
> | OnDisk |
>
> | Chained |
>
> | HashTable |
>
> | |
>
> +----------------+
>
> Fig 3: Existing PGO indexed profile format
>
> During the offline processing step (using llvm-profdata), allocation
> contexts are pruned and merged. The end result is a collection of unique
> allocation contexts with access, size and lifetime properties. Contexts are
> uniquely identified based on the call stack and are stored using a prefix
> deduplication scheme described in Section “Symbolized Memprof call stack
> section”.
>
> To fit into the PGO profile format, we need to index the profile using the
> function name. The only functions that own Memprof profile data are those
> direct callers of the allocator primitive functions. Thus the profile data
> mapping in the IR must account for potentially missing frames. Implications
> on matching of the profile data with the IR is touched upon in Section
> “Profile Data matching in IR” and will be further detailed in an upcoming
> RFC.
>
> The Memprof profile data for a particular function can then be just an
> array of MIB entries. One allocation site in the function can have multiple
> MIB entries each one of them corresponding to one allocation context.
>
> The change to the existing PGO indexed format is summarized as:
>
> -
>
> Augment the profile record data structure to include an optional MIB
> array after the value profile data [3].
> -
>
> Add one additional section before the existing OnDiskChainedHashtable
> to store the allocation stacks (referenced by the MIBs). This section is
> after the profile summary data section [4].
> -
>
> Bump the version number.
>
>
>
> Memprof portable entry format
>
> —----------------------------
>
> The raw memprof profile entry format is subject to change with version.
> However, the indexed profile entry must be backwards compatible to ensure
> that the PGO profile as a whole is backwards compatible. We propose a
> schema based format - per function description of a Memprof profile entry.
> Each field is identified by a tag. We propose the following schema:
>
> struct MIBMeta {
>
> enum MIBFieldTag {
>
> // The tag ids remain unchanged.
>
> Unused = 0,
>
> StackID = 1,
>
> AllocCount = 2,
>
> AveSize = 3,
>
> MinSize = 4,
>
> MaxSize = 5,
>
> AveAccessCount = 6,
>
> MinAccessCount = 7,
>
> MaxAccessCount = 8,
>
> AveLifetime = 9,
>
> MinLifetime = 10,
>
> MaxLifetime = 11,
>
> NumMigration = 12,
>
> NumLifetimeOverlaps = 13,
>
> NumSameAllocCPU = 14,
>
> NumSameDeallocCPU = 15
>
> };
>
> enum MIBFieldType {
>
> Unused = 0,
>
> UINT8 = 1,
>
> UINT16 = 2,
>
> UINT32 = 3, // Varint encoded
>
> UINT64 = 4, // Varint encoded
>
> };
>
> };
>
>
> // Mapping of tags to their descriptive names
>
> const char *MIBFieldName[] = {
>
> "",
>
> "StackID",
>
> "AllocCount",
>
> …
>
> "NumSameDeallocCPU"
>
> };
>
> // Mapping of tags to their types
>
> uint8 MIBFieldType [] = {
>
> 0, // unused
>
> MIBMeta::MIBFieldType::UINT64,
>
> ….
>
> };
>
> To make the field tag, field name, and field type declarations always in
> sync, a shared .inc file will be used. This file will be shared between
> compiler-rt and llvm/lib/ProfileData libraries. Dependencies across the
> compiler-rt project are not recommended for isolation.
>
> // MIBDef.inc file
>
> // define MIBEntryDef(tag, name, type) before inclusion
>
> MIBEntryDef(StackID = 1, "StackID",
>
> MIBMeta::MIBFieldType::UINT64)
>
> MIBEntryDef(AllocCount = 2, "AllocCount",
>
> MIBMeta::MIBFieldType::UINT32)
>
> ...
>
> MIBEntryDef(NumSameDeallocCPU=13,"NumSameDeallocCPU",
>
> MIBMeta::MIBFieldType:UINT8)
>
>
> enum MIBFieldTag {
>
> StartTag = 0,
>
> #define MIBEntryDef(tag, name, type) tag
>
> #include "MIBDef.inc"
>
> #undef MIBEntryDef
>
> };
>
> const char *MIBFieldName {
>
> #define MIBEntryDef(tag, name, type) name
>
> "",
>
> #include "MIBEntryDef.inc"
>
> #undef MIBEntryDef
>
> };
>
> uint8 MIBFieldType {
>
> #define MIBEntryDef(tag, name, type) type
>
> 0, // not used
>
> #include "MIBEntryDef.inc"
>
> #undef MIBDefEntry
>
> };
>
> The InstrProfRecord for each function will hold the schema and an array of
> Memprof info blocks, one for each unique allocation context.
>
> Symbolized Memprof call stack section
>
> —------------------------------------
>
> This section holds the symbolized version of the call stack section in the
> raw profile. This is necessary to enable the compiler to map recorded
> runtime PC addresses to source locations during recompilation. For space
> efficiency, this section is split into three subsections: 1. stack entry
> table 2. file path table 3. string table.
>
> Fig 4 shows the relationship between the three tables.
>
> STACK ENTRY FILE PATH STRING
>
> TABLE TABLE TABLE
>
> ┌────────┐ ┌──┬──┬──┐ ┌─────────┐
>
> │ │ │ │ │ │ │10 abc │
>
> │ │ ├──┼──┼──┤ ├─────────┤
>
> │ │ │ │ │ │◄┐ │11 def │
>
> │ │ ├──┼──┼──┤ │ ├─────────┤
>
> ├──┬──┬──┤ │ │ │ │ │ │12 ghi │
>
> ┌───┤03│LN│DI│ ├──┼──┼──┤ │ ├─────────┤
>
> │ ├──┴──┴──┤ ┌───►│03│13│01├─┘ ┌─►│13 XY.h │
>
> │ └────────┘ │ └──┴─┬┴──┘ │ └─────────┘
>
> │ │ │ │
>
> └────────────────┘ └────────┘
>
> Fig 4. The Stack Entry, File Path and String Table.
>
> LN = Line Number, DI = Discriminator
>
> * Stack Entry Table: Each uniquely identified symbolized call stack
> consists of a variable number of callsites. In the indexed format, each
> callsite needs to be represented as file:line:discriminator (as shown in
> Fig 4). The call stack location is a 64-bit field, but it is split into
> three subfields: file table index, line number, and the discriminator
> value. The file table index is a pointer to a leaf node in the prefix
> encoded scheme described below.
>
> * File path table using prefix encoding: Full path filenames have lots of
> common substrings due to the directory structure so they can be compressed.
> One simple scheme is to use the reverse prefix tree representation. In this
> representation, the name string of a directory at each level (not including
> prefixes) is represented by a node, and it is linked to its parent node. To
> summarize, the file path table is represented as an array of nodes
> organized as a forest of reversed tree structures.
>
> For instance, the strings
> “abc/def/ghi/XY.c”,
> “abc/def/ghi/XY.h”,
>
> “abc/def/jkl/UV.c”
>
> are represented as
>
> +-----------+<------+
>
> | abc | |
>
> +----->+---->+-----------+ |
>
> | | | def +-------+
>
> | | +-----------+
>
> | +-----+ ghi |<-------+----+
>
> | +-----------+ | |
>
> +-+--------->| jkl | | |
>
> | +-----------+ | |
>
> | | XY.c +--------+ |
>
> | +-----------+ |
>
> +----------+ UV.c | |
>
> +-----------+ |
>
> | XY.h +-------------+
>
> +-----------+
>
>
> Fig 5. Prefix tree representation of paths in the file path table
>
> Each parent link implies a path separator ‘/’. Furthermore we represent
> each file or directory string as an integer offset into the string table
> (see Fig 4). Thus each node holds an offset into the string table and a
> pointer to the parent (interior) directory node.
>
> * String table: To remove redundancies across prefix tree nodes in the
> file path encoding, we use a string table which stores the mapping of
> string to a unique id. The id can be simplified as the implicit offset into
> the table.
>
> Thus this representation exploits redundancy in the source file path
> prefix where there are a large number of source files in a small number of
> deeply nested directories.
>
> Symbolizing the raw PC addresses in post-processing using llvm-profdata
> requires the binary to also be provided as input. As an alternative, we
> will also experiment with incrementally generating the symbolized call
> stack section as part of the raw profile dump at the cost of increased
> profiling overhead.
>
>
> Profile Data matching in IR
>
> —--------------------------
>
> When matching the profile data, we may have already early inlined the
> direct caller of the allocation into its caller(s). This means we need to
> take some additional steps to identify the matching MIBs. For example,
> consider the following partial call graph:
>
>
>
> ┌────────────────────┐ ┌────────────────────┐
>
> A │ Call B; debug loc1 │ C │ Call B; debug loc2 │
>
> └───────┬────────────┘ └───────────┬────────┘
>
> │ │
>
> │ │
>
> │ ┌─────────────────────────┐ │
>
> └────► Call malloc; debug loc3 ◄─────┘
>
> B └─────────────────────────┘
>
> There will be 2 MIB entries, one for each context (A->B and C->B). As
> noted earlier, the MIB profile entries will be owned by the function
> calling the allocation function. Therefore, we will keep both MIB entries
> associated with function B in the profile.
>
> If early inlining (i.e. before profile matching) inlines B into A but not
> into C it will look like the following when we try to match the profile:
>
>
> ┌────────────────────┐
>
> ┌────────────────────────┐ C │ Call B; debug loc2 │
>
> A │Call malloc; debug loc3 │ └───────────┬────────┘
>
> │ ; inlined at │ │
>
> │ ; debug loc1 │ ┌──────────────┴───────────┐
>
> └────────────────────────┘ B │ Call malloc; debug loc3 │
>
> └──────────────────────────┘
>
> Because the MIB corresponding to the A->B context is associated with
> function B in the profile, we do not find it by looking at function A’s
> profile when we see function A’s malloc call during matching. To address
> this we need to keep a correspondence from debug locations to the
> associated profile information. The details of the design will be shared in
> a separate RFC in the future.
>
> [1] https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html
>
> [2] https://git.io/JzdRa
>
> [3] https://git.io/JzdRR
>
> [4] https://git.io/JzdRN
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210929/76dbdcfd/attachment-0001.html>
More information about the llvm-dev
mailing list