[llvm-dev] RFC: A binary serialization format for MemProf

Wed Sep 29 17:46:31 PDT 2021

FYI you can also view the RFC here
https://groups.google.com/g/llvm-dev/c/h1DvHguLpxU , which displays the
diagrams better (without extra wide space).

David

On Wed, Sep 29, 2021 at 3:17 PM Snehasish Kumar <snehasishk at google.com>
wrote:

> This RFC contains the following:
>
> * Proposal to introduce a new raw binary serialization format for heap
> allocation profiles
>
> * Proposal to extend the PGO indexed format to hold heap allocation
> profiles
> We look forward to your feedback on the proposals.
> Authors: snehasishk at google.com, davidxl at google.com, tejohnson at google.com
>
> Introduction
>
> —-----------
>
> The design of a sanitizer-based heap profiler (MemProf) was shared with
> llvm-dev in Jun 2020 [1]. Since then Teresa (@tejohnson) has added a
> sanitizer based heap profiler which can be enabled with -fmemory-profile.
> Today it writes out the data in a text format which can be inspected by
> users. We have used this to drive analyses of heap behaviour at Google.
> This RFC shares details on a binary serialization format for heap profiling
> data which can then be reused by the compiler to guide optimizations
> similar to traditional PGO.
>
> Similar to the existing instrumentation based PGO, the binary heap profile
> data for PGHO has two forms. One is the raw data format that is used by the
> profiler runtime, and the other is the indexed profile format used by the
> compiler. The profile data with the indexed profile data format will be
> generated by llvm-profdata from the raw profile data offline. This allows a
> single binary profile file to hold the PGO and Memprof profiling data. Fig
> 1 below shows the binary format generation and use.
>
>
>   ┌──────────────────┐ Raw Profile ┌───────────────┐     Indexed Profile
> (v8)
>
>   │ Compiler Runtime ├─────────────► llvm-profdata ├───► with Memprof data
> ───► -fprofile-use
>
>   └──────────────────┘             └───────────────┘
>
> Fig 1: Memprof binary profile lifecycle
>
> Raw Binary Format
>
> —----------------
>
> The raw binary format contains 4 sections
>
> 1. Memprof raw profile header
>
> 2. Memory mapping layout
>
> 3. Memory Info Block (MIB) records
>
> 4. Call stack information
>
>           +----------------------+
>
>           |  Magic               |
>
>           +----------------------+   H
>
>           |  Version             |   E
>
>           +----------------------+   A
>
>           |  Total Size          |   D
>
>           +----------------------+   E
>
> +---------|  Map Offset          |   R
>
> |         +----------------------+
>
> |   +-----|  MIB Offset          |
>
> |   |     +----------------------+
>
> |   |     |  Call Stack Offset   |---------+
>
> |   |     +----------------------+         |
>
> +-------->|  Number of           |   M     |
>
>     |     |  Map Entries         |   A     |
>
>     |     +----------------------+   P     |
>
>     |     |                      |         |
>
>     |     |  Map Entry           |   S     |
>
>     |     +----------------------+   E     |
>
>     |     |    ...               |   C     |
>
>     |     +----------------------+   T     |
>
>     |     |                      |   I     |
>
>     |     |  Map Entry           |   O     |
>
>     |     +----------------------+   N     |
>
>     |                                      |
>
>     |                                      |
>
>     |     +----------------------+         |
>
>     +---->|   Number of          |   M     |
>
>           |   MIB Entries        |   I     |
>
>           +----------------------+   B     |
>
>           |                      |         |
>
>           |   MIB Entry          |   S     |
>
>           +----------------------+   E     |
>
>           |    ...               |   C     |
>
>           +----------------------+   T     |
>
>           |                      |   I     |
>
>           |   MIB Entry          |   O     |
>
>           +----------------------+   N     |
>
>                                            |
>
>           +----------------------+         |
>
>           |   Number of          |<--------+
>
>           |   Call Stack Entries |  S  S
>
>           +----------------------+  T  E
>
>           |   Call Stack Entry   |  A  C
>
>           +----------------------+  C  T
>
>           |   ...                |  K  I
>
>           +----------------------+     O
>
>           |   Call Stack Entry   |     N
>
>           +----------------------+
>
> Fig 2: Memprof Raw Format
>
>
> * Memprof raw profile header
>
> The header consists of a unique identifier, version number, total size of
> the profile as well as offset into the profile for each of the following
> sections - memory mapping layout, memprof profile entries and call stack
> information. We do not intend to maintain backwards compatibility for the
> raw binary format.
>
> * Memory mapping layout
>
> The process memory mappings for the executable segment during profiling
> are stored in this section. This allows symbolization during post
> processing for binaries which are built with position independent code. For
> now all read only, executable  mappings are recorded, however in the
> future, mappings for heap data can also potentially be stored. For each
> mapping, we record
>
>    -
>
>    Start (virtual address)
>    -
>
>    End (virtual address)
>    -
>
>    Offset (from the file it was mmap-ed from)
>    -
>
>    buildId (linker generated hash -Wl,build-id
>
>
> *  MIB Records
>
> The profile information we collect is currently defined in [2]. The
> section begins with a 8 byte entry containing the number of profile entries
> in this section. Each entry is uniquely identified by the dynamic calling
> context of the allocation site. For each entry, various metrics such as
> access counts, allocation sizes, object lifetimes etc are computed via
> profiling. This section may contain multiple entries identified by the same
> callstack id. Subsequent processing to convert and merge multiple raw
> profiles will deduplicate any such entries.
>
> * Call stack information
> Each memprof profile entry is uniquely identified by its dynamic calling
> context. In this section we record the identifier and its corresponding
> call stack trace. We use the sanitizer stack depot provided identifier and
> serialize the trace for each without deduplication. The section begins with
> a 8b entry containing the number of call stack entries. Each call stack
> entry contains a 8b field which denotes how many contexts are recorded in
> this entry. Each frame is identified by an 8b program counter address which
> holds the call instruction virtual address - 1. Further deduplication is
> possible though we do not do so at this time.
>
> * Raw Profile Characteristics
>
> To understand the characteristics of raw binary profiles generated by
> Memprof, we experimented with a clang bootstrap build. We ran 500
> invocations of Memprof-instrumented clang on clang source code. Each
> invocation produced a raw binary profile and we present some aggregate
> information about them below:
>
> +---------------------------------+--------+---------+---------+
>
> |                                 | Min    | Median  | Max     |
>
> +---------------------------------+--------+---------+---------+
>
> | Unique allocation contexts      | 940    | 10661   | 35355   |
>
> +---------------------------------+--------+---------+---------+
>
> | MIB records size (bytes)        | 101528 | 1151396 | 3818348 |
>
> +---------------------------------+--------+---------+---------+
>
> | Call stack section size (bytes) | 31080  | 419048  | 1439144 |
>
> +---------------------------------+--------+---------+---------+
>
> | Total Size (bytes)              | 133336 | 1571680 | 5258220 |
>
> +---------------------------------+--------+---------+---------+
>
> The size of the header is 48 bytes and the number of read only executable
> memory maps is usually small (12 in this case) with each map entry
> consuming 64 bytes. We find that the raw profiles are highly compressible,
> on average files are compressed by 81%. The largest raw profile in the
> dataset is ~5M in size. It is compressed to 975K using zip on default
> settings. In contrast for the same clang build, the instrumented PGO raw
> profile is ~21M in size (zip compressed 73%). Note that the Memprof profile
> size is proportional to the number of allocation contexts during
> profiling.
>
> Since these are profiles from individual invocations, they must be merged
> before use. This is performed implicitly during the conversion to indexed
> profile format by llvm-profdata. MIBs are merged based on their call stack.
>
> Memprof extensions for the indexed profile format
>
> —------------------------------------------------
>
> +----------------+
>
> |   MAGIC        |
>
> +----------------+
>
> |   VERSION      |
>
> +----------------+
>
> |   HASHTYPE     |
>
> +--+----------+--+
>
> |HASHTAB OFFSET  |-------+
>
> +--+----------+--+       |
>
> +----------------+       |
>
> |                |       |
>
> |  PROFILE       |       |
>
> |  SUMMARY       |       |
>
> |  DATA          |       |
>
> +----------------+       |
>
> +----------------+ <----+
>
> |                |
>
> |   OnDisk       |
>
> |   Chained      |
>
> |   HashTable    |
>
> |                |
>
> +----------------+
>
> Fig 3: Existing PGO indexed profile format
>
> During the offline processing step (using llvm-profdata), allocation
> contexts are pruned and merged. The end result is a collection of unique
> allocation contexts with access, size and lifetime properties. Contexts are
> uniquely identified based on the call stack and are stored using a prefix
> deduplication scheme described in Section “Symbolized Memprof call stack
> section”.
>
> To fit into the PGO profile format, we need to index the profile using the
> function name. The only functions that own Memprof profile data are those
> direct callers of the allocator primitive functions. Thus the profile data
> mapping in the IR must account for potentially missing frames. Implications
> on matching of the profile data with the IR is touched upon in Section
> “Profile Data matching in IR” and will be further detailed in an upcoming
> RFC.
>
> The Memprof profile data for a particular function can then be just an
> array of MIB entries. One allocation site in the function can have multiple
> MIB entries each one of them corresponding to one allocation context.
>
> The change to the existing PGO indexed format is summarized as:
>
>    -
>
>    Augment the profile record data structure to include an optional MIB
>    array after the value profile data [3].
>    -
>
>    Add one additional section before the existing OnDiskChainedHashtable
>    to store the allocation stacks (referenced by the MIBs). This section is
>    after the profile summary data section [4].
>    -
>
>    Bump the version number.
>
>
>
> Memprof portable entry format
>
> —----------------------------
>
> The raw memprof profile entry format is subject to change with version.
> However, the indexed profile entry must be backwards compatible to ensure
> that the PGO profile as a whole is backwards compatible. We propose a
> schema based format - per function description of a Memprof profile entry.
> Each field is identified by a tag. We propose the following schema:
>
> struct MIBMeta {
>
>        enum MIBFieldTag {
>
>           // The tag ids remain unchanged.
>
>           Unused = 0,
>
>           StackID = 1,
>
>           AllocCount = 2,
>
>           AveSize = 3,
>
>           MinSize = 4,
>
>           MaxSize = 5,
>
>           AveAccessCount = 6,
>
>           MinAccessCount = 7,
>
>           MaxAccessCount = 8,
>
>           AveLifetime = 9,
>
>           MinLifetime = 10,
>
>           MaxLifetime = 11,
>
>           NumMigration = 12,
>
>           NumLifetimeOverlaps = 13,
>
>           NumSameAllocCPU = 14,
>
>           NumSameDeallocCPU = 15
>
>        };
>
>        enum MIBFieldType {
>
>          Unused = 0,
>
>          UINT8 = 1,
>
>          UINT16 = 2,
>
>          UINT32 = 3,      // Varint encoded
>
>          UINT64 = 4,      // Varint encoded
>
>       };
>
> };
>
>
> // Mapping of tags to their descriptive names
>
> const char *MIBFieldName[] = {
>
>         "",
>
>         "StackID",
>
>         "AllocCount",
>
>          …
>
>         "NumSameDeallocCPU"
>
> };
>
> // Mapping of tags to their types
>
> uint8 MIBFieldType []  = {
>
>          0, // unused
>
>          MIBMeta::MIBFieldType::UINT64,
>
>             ….
>
> };
>
> To make the field tag, field name, and field type declarations always in
> sync, a shared .inc file will be used. This file will be shared between
> compiler-rt and llvm/lib/ProfileData libraries. Dependencies across the
> compiler-rt project are not recommended for isolation.
>
> // MIBDef.inc file
>
> // define  MIBEntryDef(tag, name, type) before inclusion
>
> MIBEntryDef(StackID = 1, "StackID",
>
>             MIBMeta::MIBFieldType::UINT64)
>
> MIBEntryDef(AllocCount = 2, "AllocCount",
>
>             MIBMeta::MIBFieldType::UINT32)
>
> ...
>
> MIBEntryDef(NumSameDeallocCPU=13,"NumSameDeallocCPU",
>
>             MIBMeta::MIBFieldType:UINT8)
>
>
> enum MIBFieldTag {
>
>    StartTag = 0,
>
>    #define MIBEntryDef(tag, name, type) tag
>
>    #include "MIBDef.inc"
>
>    #undef MIBEntryDef
>
> };
>
> const char *MIBFieldName {
>
>     #define MIBEntryDef(tag, name, type) name
>
>     "",
>
>     #include "MIBEntryDef.inc"
>
>     #undef MIBEntryDef
>
> };
>
> uint8 MIBFieldType {
>
>    #define MIBEntryDef(tag, name, type) type
>
>    0, // not used
>
>    #include "MIBEntryDef.inc"
>
>    #undef MIBDefEntry
>
> };
>
> The InstrProfRecord for each function will hold the schema and an array of
> Memprof info blocks, one for each unique allocation context.
>
> Symbolized Memprof call stack section
>
> —------------------------------------
>
> This section holds the symbolized version of the call stack section in the
> raw profile. This is necessary to enable the compiler to map recorded
> runtime PC addresses to source locations during recompilation. For space
> efficiency, this section is split into three subsections: 1. stack entry
> table 2. file path table 3. string table.
>
> Fig 4 shows the relationship between the three tables.
>
>      STACK ENTRY        FILE PATH         STRING
>
>         TABLE             TABLE           TABLE
>
>      ┌────────┐        ┌──┬──┬──┐       ┌─────────┐
>
>      │        │        │  │  │  │       │10 abc   │
>
>      │        │        ├──┼──┼──┤       ├─────────┤
>
>      │        │        │  │  │  │◄┐     │11 def   │
>
>      │        │        ├──┼──┼──┤ │     ├─────────┤
>
>      ├──┬──┬──┤        │  │  │  │ │     │12 ghi   │
>
>  ┌───┤03│LN│DI│      ├──┼──┼──┤ │     ├─────────┤
>
>  │   ├──┴──┴──┤   ┌───►│03│13│01├─┘  ┌─►│13 XY.h  │
>
>  │   └────────┘   │    └──┴─┬┴──┘    │  └─────────┘
>
>  │                │         │        │
>
>  └────────────────┘         └────────┘
>
> Fig 4. The Stack Entry, File Path and String Table.
>
>        LN = Line Number, DI = Discriminator
>
> * Stack Entry Table: Each uniquely identified symbolized call stack
> consists of a variable number of callsites. In the indexed format, each
> callsite needs to be represented as file:line:discriminator (as shown in
> Fig 4). The call stack location is a 64-bit field, but it is split into
> three subfields: file table index, line number, and the discriminator
> value. The file table index is a pointer to a leaf node in the prefix
> encoded scheme described below.
>
> * File path table using prefix encoding: Full path filenames have lots of
> common substrings due to the directory structure so they can be compressed.
> One simple scheme is to use the reverse prefix tree representation. In this
> representation, the name string of a directory at each level (not including
> prefixes) is represented by a node, and it is linked to its parent node. To
> summarize, the file path table is represented as an array of nodes
> organized as a forest of reversed tree structures.
>
> For instance, the strings
> “abc/def/ghi/XY.c”,
> “abc/def/ghi/XY.h”,
>
> “abc/def/jkl/UV.c”
>
> are represented as
>
>              +-----------+<------+
>
>              |  abc      |       |
>
> +----->+---->+-----------+       |
>
> |      |     |  def      +-------+
>
> |      |     +-----------+
>
> |      +-----+  ghi      |<-------+----+
>
> |            +-----------+        |    |
>
> +-+--------->|  jkl      |        |    |
>
>   |          +-----------+        |    |
>
>   |          |  XY.c     +--------+    |
>
>   |          +-----------+             |
>
>   +----------+  UV.c     |             |
>
>              +-----------+             |
>
>              |  XY.h     +-------------+
>
>              +-----------+
>
>
> Fig 5. Prefix tree representation of paths in the file path table
>
> Each parent link implies a path separator ‘/’. Furthermore we represent
> each file or directory string as an integer offset into the string table
> (see Fig 4). Thus each node holds an offset into the string table and a
> pointer to the parent (interior) directory node.
>
> * String table: To remove redundancies across prefix tree nodes in the
> file path encoding, we use a string table which stores the mapping of
> string to a unique id. The id can be simplified as the implicit offset into
> the table.
>
> Thus this representation exploits redundancy in the source file path
> prefix where there are a large number of source files in a small number of
> deeply nested directories.
>
> Symbolizing the raw PC addresses in post-processing using llvm-profdata
> requires the binary to also be provided as input. As an alternative, we
> will also experiment with incrementally generating the symbolized call
> stack section as part of the raw profile dump at the cost of increased
> profiling overhead.
>
>
> Profile Data matching in IR
>
> —--------------------------
>
> When matching the profile data, we may have already early inlined the
> direct caller of the allocation into its caller(s). This means we need to
> take some additional steps to identify the matching MIBs. For example,
> consider the following partial call graph:
>
>
>
>    ┌────────────────────┐           ┌────────────────────┐
>
>  A │ Call B; debug loc1 │         C │ Call B; debug loc2 │
>
>    └───────┬────────────┘           └───────────┬────────┘
>
>            │                                    │
>
>            │                                    │
>
>            │    ┌─────────────────────────┐     │
>
>            └────► Call malloc; debug loc3 ◄─────┘
>
>               B └─────────────────────────┘
>
> There will be 2 MIB entries, one for each context (A->B and C->B). As
> noted earlier, the MIB profile entries will be owned by the function
> calling the allocation function. Therefore, we will keep both MIB entries
> associated with function B in the profile.
>
> If early inlining (i.e. before profile matching) inlines B into A but not
> into C it will look like the following when we try to match the profile:
>
>
>                                           ┌────────────────────┐
>
>     ┌────────────────────────┐          C │ Call B; debug loc2 │
>
>   A │Call malloc; debug loc3 │            └───────────┬────────┘
>
>     │           ; inlined at │                        │
>
>     │           ; debug loc1 │         ┌──────────────┴───────────┐
>
>     └────────────────────────┘       B │ Call malloc; debug loc3  │
>
>                                        └──────────────────────────┘
>
> Because the MIB corresponding to the A->B context is associated with
> function B in the profile, we do not find it by looking at function A’s
> profile when we see function A’s malloc call during matching. To address
> this we need to keep a correspondence from debug locations to the
> associated profile information. The details of the design will be shared in
> a separate RFC in the future.
>
> [1] https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html
>
> [2] https://git.io/JzdRa
>
> [3] https://git.io/JzdRR
>
> [4] https://git.io/JzdRN
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210929/76dbdcfd/attachment-0001.html>