[llvm-dev] RFC: A binary serialization format for MemProf

Hongtao Yu via llvm-dev llvm-dev at lists.llvm.org
Mon Oct 4 12:10:11 PDT 2021


Hello Snehasish,

It’s great to see this RFC about the profile format of the heap profiler since last the RFC. I have a couple questions about how calling contexts are stored and processed by the compiler:

  1.  How are recursive allocation contexts stored? Wondering if there’s any recursive compression performed. For example, a tree-based construction algorithm may create tree nodes recursively. Is each tree node object modeled by its unique dynamic context?
  2.  Will the contexts of a leaf function merged during compilation when the leaf function is not inlined? If so, where does the merging happen?

We have seen similar issues when working on CSSPGO (Context-sensitive PGO). I’m wondering if they could be a problem here.

Thanks,
Hongtao


From: Snehasish Kumar <snehasishk at google.com>
Date: Wednesday, September 29, 2021 at 3:17 PM
To: llvm-dev <llvm-dev at lists.llvm.org>, Vedant Kumar <vsk at apple.com>, Wenlei He <wenlei at fb.com>, andreybokhanko at gmail.com <andreybokhanko at gmail.com>, David Li <davidxl at google.com>, Teresa Johnson <tejohnson at google.com>
Subject: RFC: A binary serialization format for MemProf
This RFC contains the following:


* Proposal to introduce a new raw binary serialization format for heap allocation profiles

* Proposal to extend the PGO indexed format to hold heap allocation profiles
We look forward to your feedback on the proposals.
Authors: snehasishk at google.com<mailto:snehasishk at google.com>, davidxl at google.com<mailto:davidxl at google.com>, tejohnson at google.com<mailto:tejohnson at google.com>


Introduction

—-----------

The design of a sanitizer-based heap profiler (MemProf) was shared with llvm-dev in Jun 2020 [1]. Since then Teresa (@tejohnson) has added a sanitizer based heap profiler which can be enabled with -fmemory-profile. Today it writes out the data in a text format which can be inspected by users. We have used this to drive analyses of heap behaviour at Google. This RFC shares details on a binary serialization format for heap profiling data which can then be reused by the compiler to guide optimizations similar to traditional PGO.


Similar to the existing instrumentation based PGO, the binary heap profile data for PGHO has two forms. One is the raw data format that is used by the profiler runtime, and the other is the indexed profile format used by the compiler. The profile data with the indexed profile data format will be generated by llvm-profdata from the raw profile data offline. This allows a single binary profile file to hold the PGO and Memprof profiling data. Fig 1 below shows the binary format generation and use.


  ┌──────────────────┐ Raw Profile ┌───────────────┐     Indexed Profile (v8)

  │ Compiler Runtime ├─────────────► llvm-profdata ├───► with Memprof data ───► -fprofile-use

  └──────────────────┘             └───────────────┘


Fig 1: Memprof binary profile lifecycle


Raw Binary Format

—----------------

The raw binary format contains 4 sections


1. Memprof raw profile header

2. Memory mapping layout

3. Memory Info Block (MIB) records

4. Call stack information


          +----------------------+

          |  Magic               |

          +----------------------+   H

          |  Version             |   E

          +----------------------+   A

          |  Total Size          |   D

          +----------------------+   E

+---------|  Map Offset          |   R

|         +----------------------+

|   +-----|  MIB Offset          |

|   |     +----------------------+

|   |     |  Call Stack Offset   |---------+

|   |     +----------------------+         |

+-------->|  Number of           |   M     |

    |     |  Map Entries         |   A     |

    |     +----------------------+   P     |

    |     |                      |         |

    |     |  Map Entry           |   S     |

    |     +----------------------+   E     |

    |     |    ...               |   C     |

    |     +----------------------+   T     |

    |     |                      |   I     |

    |     |  Map Entry           |   O     |

    |     +----------------------+   N     |

    |                                      |

    |                                      |

    |     +----------------------+         |

    +---->|   Number of          |   M     |

          |   MIB Entries        |   I     |

          +----------------------+   B     |

          |                      |         |

          |   MIB Entry          |   S     |

          +----------------------+   E     |

          |    ...               |   C     |

          +----------------------+   T     |

          |                      |   I     |

          |   MIB Entry          |   O     |

          +----------------------+   N     |

                                           |

          +----------------------+         |

          |   Number of          |<--------+

          |   Call Stack Entries |  S  S

          +----------------------+  T  E

          |   Call Stack Entry   |  A  C

          +----------------------+  C  T

          |   ...                |  K  I

          +----------------------+     O

          |   Call Stack Entry   |     N

          +----------------------+


Fig 2: Memprof Raw Format


* Memprof raw profile header

The header consists of a unique identifier, version number, total size of the profile as well as offset into the profile for each of the following sections - memory mapping layout, memprof profile entries and call stack information. We do not intend to maintain backwards compatibility for the raw binary format.


* Memory mapping layout

The process memory mappings for the executable segment during profiling are stored in this section. This allows symbolization during post processing for binaries which are built with position independent code. For now all read only, executable  mappings are recorded, however in the future, mappings for heap data can also potentially be stored. For each mapping, we record
·         Start (virtual address)
·         End (virtual address)
·         Offset (from the file it was mmap-ed from)
·         buildId (linker generated hash -Wl,build-id


*  MIB Records

The profile information we collect is currently defined in [2]. The section begins with a 8 byte entry containing the number of profile entries in this section. Each entry is uniquely identified by the dynamic calling context of the allocation site. For each entry, various metrics such as access counts, allocation sizes, object lifetimes etc are computed via profiling. This section may contain multiple entries identified by the same callstack id. Subsequent processing to convert and merge multiple raw profiles will deduplicate any such entries.


* Call stack information
Each memprof profile entry is uniquely identified by its dynamic calling context. In this section we record the identifier and its corresponding call stack trace. We use the sanitizer stack depot provided identifier and serialize the trace for each without deduplication. The section begins with a 8b entry containing the number of call stack entries. Each call stack entry contains a 8b field which denotes how many contexts are recorded in this entry. Each frame is identified by an 8b program counter address which holds the call instruction virtual address - 1. Further deduplication is possible though we do not do so at this time.


* Raw Profile Characteristics

To understand the characteristics of raw binary profiles generated by Memprof, we experimented with a clang bootstrap build. We ran 500 invocations of Memprof-instrumented clang on clang source code. Each invocation produced a raw binary profile and we present some aggregate information about them below:


+---------------------------------+--------+---------+---------+

|                                 | Min    | Median  | Max     |

+---------------------------------+--------+---------+---------+

| Unique allocation contexts      | 940    | 10661   | 35355   |

+---------------------------------+--------+---------+---------+

| MIB records size (bytes)        | 101528 | 1151396 | 3818348 |

+---------------------------------+--------+---------+---------+

| Call stack section size (bytes) | 31080  | 419048  | 1439144 |

+---------------------------------+--------+---------+---------+

| Total Size (bytes)              | 133336 | 1571680 | 5258220 |

+---------------------------------+--------+---------+---------+


The size of the header is 48 bytes and the number of read only executable memory maps is usually small (12 in this case) with each map entry consuming 64 bytes. We find that the raw profiles are highly compressible, on average files are compressed by 81%. The largest raw profile in the dataset is ~5M in size. It is compressed to 975K using zip on default settings. In contrast for the same clang build, the instrumented PGO raw profile is ~21M in size (zip compressed 73%). Note that the Memprof profile size is proportional to the number of allocation contexts during profiling.


Since these are profiles from individual invocations, they must be merged before use. This is performed implicitly during the conversion to indexed profile format by llvm-profdata. MIBs are merged based on their call stack.


Memprof extensions for the indexed profile format

—------------------------------------------------


+----------------+

|   MAGIC        |

+----------------+

|   VERSION      |

+----------------+

|   HASHTYPE     |

+--+----------+--+

|HASHTAB OFFSET  |-------+

+--+----------+--+       |

+----------------+       |

|                |       |

|  PROFILE       |       |

|  SUMMARY       |       |

|  DATA          |       |

+----------------+       |

+----------------+ <----+

|                |

|   OnDisk       |

|   Chained      |

|   HashTable    |

|                |

+----------------+


Fig 3: Existing PGO indexed profile format


During the offline processing step (using llvm-profdata), allocation contexts are pruned and merged. The end result is a collection of unique allocation contexts with access, size and lifetime properties. Contexts are uniquely identified based on the call stack and are stored using a prefix deduplication scheme described in Section “Symbolized Memprof call stack section”.


To fit into the PGO profile format, we need to index the profile using the function name. The only functions that own Memprof profile data are those direct callers of the allocator primitive functions. Thus the profile data mapping in the IR must account for potentially missing frames. Implications on matching of the profile data with the IR is touched upon in Section “Profile Data matching in IR” and will be further detailed in an upcoming RFC.


The Memprof profile data for a particular function can then be just an array of MIB entries. One allocation site in the function can have multiple MIB entries each one of them corresponding to one allocation context.


The change to the existing PGO indexed format is summarized as:
·         Augment the profile record data structure to include an optional MIB array after the value profile data [3].
·         Add one additional section before the existing OnDiskChainedHashtable to store the allocation stacks (referenced by the MIBs). This section is after the profile summary data section [4].
·         Bump the version number.


Memprof portable entry format

—----------------------------

The raw memprof profile entry format is subject to change with version. However, the indexed profile entry must be backwards compatible to ensure that the PGO profile as a whole is backwards compatible. We propose a schema based format - per function description of a Memprof profile entry. Each field is identified by a tag. We propose the following schema:


struct MIBMeta {

       enum MIBFieldTag {

          // The tag ids remain unchanged.

          Unused = 0,

          StackID = 1,

          AllocCount = 2,

          AveSize = 3,

          MinSize = 4,

          MaxSize = 5,

          AveAccessCount = 6,

          MinAccessCount = 7,

          MaxAccessCount = 8,

          AveLifetime = 9,

          MinLifetime = 10,

          MaxLifetime = 11,

          NumMigration = 12,

          NumLifetimeOverlaps = 13,

          NumSameAllocCPU = 14,

          NumSameDeallocCPU = 15

       };


       enum MIBFieldType {

         Unused = 0,

         UINT8 = 1,

         UINT16 = 2,

         UINT32 = 3,      // Varint encoded

         UINT64 = 4,      // Varint encoded

      };

};


// Mapping of tags to their descriptive names

const char *MIBFieldName[] = {

        "",

        "StackID",

        "AllocCount",

         …

        "NumSameDeallocCPU"

};


// Mapping of tags to their types

uint8 MIBFieldType []  = {

         0, // unused

         MIBMeta::MIBFieldType::UINT64,

            ….

};


To make the field tag, field name, and field type declarations always in sync, a shared .inc file will be used. This file will be shared between compiler-rt and llvm/lib/ProfileData libraries. Dependencies across the compiler-rt project are not recommended for isolation.


// MIBDef.inc file

// define  MIBEntryDef(tag, name, type) before inclusion

MIBEntryDef(StackID = 1, "StackID",

            MIBMeta::MIBFieldType::UINT64)

MIBEntryDef(AllocCount = 2, "AllocCount",

            MIBMeta::MIBFieldType::UINT32)

...

MIBEntryDef(NumSameDeallocCPU=13,"NumSameDeallocCPU",

            MIBMeta::MIBFieldType:UINT8)


enum MIBFieldTag {

   StartTag = 0,

   #define MIBEntryDef(tag, name, type) tag

   #include "MIBDef.inc"

   #undef MIBEntryDef

};


const char *MIBFieldName {

    #define MIBEntryDef(tag, name, type) name

    "",

    #include "MIBEntryDef.inc"

    #undef MIBEntryDef

};


uint8 MIBFieldType {

   #define MIBEntryDef(tag, name, type) type

   0, // not used

   #include "MIBEntryDef.inc"

   #undef MIBDefEntry

};


The InstrProfRecord for each function will hold the schema and an array of Memprof info blocks, one for each unique allocation context.


Symbolized Memprof call stack section

—------------------------------------


This section holds the symbolized version of the call stack section in the raw profile. This is necessary to enable the compiler to map recorded runtime PC addresses to source locations during recompilation. For space efficiency, this section is split into three subsections: 1. stack entry table 2. file path table 3. string table.


Fig 4 shows the relationship between the three tables.


     STACK ENTRY        FILE PATH         STRING

        TABLE             TABLE           TABLE

     ┌────────┐        ┌──┬──┬──┐       ┌─────────┐

     │        │        │  │  │  │       │10 abc   │

     │        │        ├──┼──┼──┤       ├─────────┤

     │        │        │  │  │  │◄┐     │11 def   │

     │        │        ├──┼──┼──┤ │     ├─────────┤

     ├──┬──┬──┤        │  │  │  │ │     │12 ghi   │

 ┌───┤03│LN│DI│      ├──┼──┼──┤ │     ├─────────┤

 │   ├──┴──┴──┤   ┌───►│03│13│01├─┘  ┌─►│13 XY.h  │

 │   └────────┘   │    └──┴─┬┴──┘    │  └─────────┘

 │                │         │        │

 └────────────────┘         └────────┘


Fig 4. The Stack Entry, File Path and String Table.

       LN = Line Number, DI = Discriminator


* Stack Entry Table: Each uniquely identified symbolized call stack consists of a variable number of callsites. In the indexed format, each callsite needs to be represented as file:line:discriminator (as shown in Fig 4). The call stack location is a 64-bit field, but it is split into three subfields: file table index, line number, and the discriminator value. The file table index is a pointer to a leaf node in the prefix encoded scheme described below.


* File path table using prefix encoding: Full path filenames have lots of common substrings due to the directory structure so they can be compressed. One simple scheme is to use the reverse prefix tree representation. In this representation, the name string of a directory at each level (not including prefixes) is represented by a node, and it is linked to its parent node. To summarize, the file path table is represented as an array of nodes organized as a forest of reversed tree structures.


For instance, the strings
“abc/def/ghi/XY.c”,
“abc/def/ghi/XY.h”,

“abc/def/jkl/UV.c”

are represented as


             +-----------+<------+

             |  abc      |       |

+----->+---->+-----------+       |

|      |     |  def      +-------+

|      |     +-----------+

|      +-----+  ghi      |<-------+----+

|            +-----------+        |    |

+-+--------->|  jkl      |        |    |

  |          +-----------+        |    |

  |          |  XY.c     +--------+    |

  |          +-----------+             |

  +----------+  UV.c     |             |

             +-----------+             |

             |  XY.h     +-------------+

             +-----------+


Fig 5. Prefix tree representation of paths in the file path table


Each parent link implies a path separator ‘/’. Furthermore we represent each file or directory string as an integer offset into the string table (see Fig 4). Thus each node holds an offset into the string table and a pointer to the parent (interior) directory node.


* String table: To remove redundancies across prefix tree nodes in the file path encoding, we use a string table which stores the mapping of string to a unique id. The id can be simplified as the implicit offset into the table.


Thus this representation exploits redundancy in the source file path prefix where there are a large number of source files in a small number of deeply nested directories.


Symbolizing the raw PC addresses in post-processing using llvm-profdata requires the binary to also be provided as input. As an alternative, we will also experiment with incrementally generating the symbolized call stack section as part of the raw profile dump at the cost of increased profiling overhead.


Profile Data matching in IR

—--------------------------


When matching the profile data, we may have already early inlined the direct caller of the allocation into its caller(s). This means we need to take some additional steps to identify the matching MIBs. For example, consider the following partial call graph:



   ┌────────────────────┐           ┌────────────────────┐

 A │ Call B; debug loc1 │         C │ Call B; debug loc2 │

   └───────┬────────────┘           └───────────┬────────┘

           │                                    │

           │                                    │

           │    ┌─────────────────────────┐     │

           └────► Call malloc; debug loc3 ◄─────┘

              B └─────────────────────────┘


There will be 2 MIB entries, one for each context (A->B and C->B). As noted earlier, the MIB profile entries will be owned by the function calling the allocation function. Therefore, we will keep both MIB entries associated with function B in the profile.


If early inlining (i.e. before profile matching) inlines B into A but not into C it will look like the following when we try to match the profile:


                                          ┌────────────────────┐

    ┌────────────────────────┐          C │ Call B; debug loc2 │

  A │Call malloc; debug loc3 │            └───────────┬────────┘

    │           ; inlined at │                        │

    │           ; debug loc1 │         ┌──────────────┴───────────┐

    └────────────────────────┘       B │ Call malloc; debug loc3  │

                                       └──────────────────────────┘


Because the MIB corresponding to the A->B context is associated with function B in the profile, we do not find it by looking at function A’s profile when we see function A’s malloc call during matching. To address this we need to keep a correspondence from debug locations to the associated profile information. The details of the design will be shared in a separate RFC in the future.


[1] https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html<https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html>

[2] https://git.io/JzdRa<https://git.io/JzdRa>

[3] https://git.io/JzdRR<https://git.io/JzdRR>

[4] https://git.io/JzdRN<https://git.io/JzdRN>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211004/f5d3cf51/attachment-0001.html>


More information about the llvm-dev mailing list