[llvm-dev] RFC: IR metadata format for MemProf

Wed Oct 27 13:01:51 PDT 2021

This RFC describes the IR metadata format for the sanitizer-based heap
profiler (MemProf) data when it is fed back into a subsequent compile for
profile guided heap optimization. Additional background can be found in the
following RFCs:

   -

   RFC: Sanitizer-based Heap Profiler [1]
   -

   RFC: A binary serialization format for MemProf [2]

We look forward to your feedback.
Authors: tejohnson at google.com, snehasishk at google.com, davidxl at google.com
Requirements

The profile format will need to be reasonably compact within IR, but also
facilitate efficiently identifying callsites within the stack contexts for
use in interprocedural (and cross module) optimization for context
disambiguation. These optimizations will require transformations to
disambiguate the context at a particular allocation, which will require
call graph based analysis and optimization.
Input Profile Format

The profile will be in an index binary format, which is detailed in [2]. In
the index format, the profiles for each allocation site will be located
with the profile data for the function containing the allocation call site.
Each allocation may have multiple profile entries (known as a MIB, or
Memory Info Block) uniquely identified by stack context. The entries in the
stack context will be symbolized and include file, line and discriminator.
Metadata Format

Similar to branch weights and value profile data from regular PGO, the PGHO
profile will be annotated as metadata onto relevant instructions. A natural
instruction to attach the profile metadata is on the allocation callsite,
so these allocation calls can be identified and handled by the subsequent
heap optimization pass. As an example, this profile data can be used to
enable automatic application of hot and cold hints to allocations, for use
by a runtime allocator such as tcmalloc, where support for such allocation
hints was recently added [3].

However, in order to identify ancestor callsites within an allocation’s
call stack context that require modification for disambiguating the context
at the allocation site, e.g. via cloning, we will also want to attach
metadata to these callsites. This is particularly important for contexts
that cross module boundaries, so that we can identify them in ThinLTO
summaries for cross module coordination of context transformations.

To identify and correlate entries in a context, we will use a unique
identifier for each stack entry. Specifically, we will use the 64-bit value
from the stack entry table in the indexed profile format which is formed
from the index into the file path table along with the line and
discriminator. Another option would be to represent the stack entries using
existing debug metadata. However, for stack entries in another module we
would need to synthesize additional debug location metadata in the module
containing MIB profile data that references that stack context entry.

Assume the following working example. For simplicity, all are shown as
being in the same module, however, these function definitions could
theoretically be located in multiple different modules.

x.cc

  1 main() {

  2    foo();  // stack entry id: 123

  3 }

  4

  5 foo() {

  6    baz();  // stack entry id: 234

  7 }

  8

  9 baz() {

 10    if (x)

 11       bar();  // stack entry id: 345

 12    else

 13       bar();  // stack entry id: 456

 14 }

 15

 16 bar() {

 17    malloc(4);  // stack entry id: 567

 18 }

The call to malloc has 2 possible calling contexts:

   1.

   main -> foo (x.cc:2) -> baz (x.cc:6) -> bar (x.cc:11) -> malloc (x.cc:17)
   2.

   main -> foo (x.cc:2) -> baz (x.cc:6) -> bar (x.cc:13) -> malloc (x.cc:17)

where the stack entry id for each callsite, taken from the profile’s stack
entry table contents, is shown in the code comments. The corresponding full
contexts in terms of stack entry ids, listed from the leaf allocation
callsite up to the root are:

   1.

   567, 345, 234, 123
   2.

   567, 456, 234, 123

Assuming both contexts execute at runtime, the allocation will end up with
2 MIBs in the profile, one for each of the above contexts.

To represent this in the IR, we propose 2 new metadata attachment types, as
described below.
Callsite metadata

The !callsite metadata is used to associate a callsite with its
corresponding references in MIB stack contexts. It contains the associated
64-bit stack entry table value for that callsite from the indexed profile,
and is initially only on non-allocation callsites. As will be described
later, after inlining it can contain multiple entry ids or be propagated
onto allocation callsites.

In the above example, for the call to foo(), which had stack entry id 123,
the IR callsite would be decorated with a !callsite metadata containing
that stack entry id:

  tail call void @_Z3foov(), !dbg !12, !callsite !14

!14 = !{i64 123}

Note that this call may be in a different module initially than the
referencing MIB metadata. In order to disambiguate the context across
modules, some form of LTO would be required. ThinLTO summary support will
be added to reflect the cross-module contexts and enable cross module
optimization of the contexts.

Also, while for MemProf the ids will be assigned uniquely using information
from the MemProf profile, other types of context sensitive profiles could
simply reuse the same id after matching with line table information, or at
least leverage the same metadata attachment to assign their own unique ids
if there is no MemProf profile.
Memprof metadata

The !memprof metadata describes the MIBs for the leaf allocation callsite
it is attached to. If there are multiple stack contexts leading to that
allocation, it will have a single !memprof metadata attachment, with a
level of indirection used to list all related MIB, as shown in the later
example.

As with the indexed profile format, we need to be able to add or modify
fields of the MIB entries while maintaining backwards compatibility with
older bitcode. Therefore, we use a schema format with the MIB profile entry
fields described by a “Memprof Schema” module level metadata, for example:

!llvm.module.flags = !{!1}

!1 = !{i32 1, !”Memprof Schema”,!”Stack”, !”AllocCount”, !”AveSize”,
!”MinSize”, !”MaxSize”, !”AveAccessCount”, !”MinAccessCount”,
!”MaxAccessCount”, !”AveLifetime”, !”MinLifetime”, !”MaxLifetime”,
!”NumMigration”, !”NumLifetimeOverlaps” }

The first (merge behavior) field is 1 (ModFlagBehavior::Error), meaning
that it is an error to merge modules with different values, or in other
words, merging modules compiled with different profiles generated with
different versions of the indexed profile format.

Assume we are using the schema shown in the above module flag metadata. In
the earlier example, where allocation was reached by 2 different profile
contexts (and therefore had 2 MIBs), the !memprof metadata would be
structured similar to the following:

  %call = tail call noalias align 16 dereferenceable_or_null(4) i8*
@malloc(i64 4) #4, !dbg !261, !memprof !13

!13 = !{!14, !16}

!14 = !{!15, i32 2, i32 4, i32 4, i32 4, i64 10, i64 5, i64 15, i32 30, i32
20, i32 40, i32 0, i32 0}

!16 = !{!17, i32 1, i32 4, i32 4, i32 4, i64 5, i64 1, i64 10, i32 10, i32
10, i32 10, i32 0, i32 0}

The first memprof metadata entry references another metadata containing the
call stack context, which contains entries that as described earlier are
unique 64-bit values used to identify stack entries in the indexed profile.
These stack entry ids enable correlation with the !callsite metadata
attached to calls, particularly across modules. Note that we don’t need to
include a stack entry id for the allocation callsite, since the metadata is
already correlated with it by attachment. Therefore, the contexts for the
two MIB shown here would contain the list of ids described for the example,
minus 567 for the allocation callsite. So “345, 234, 123” and “456, 234,
123”.

This call stack metadata (e.g. !15 and !17 in the above example), should
identify the stack entries in order from the leaf allocation’s caller to
the root caller. There are several options for organizing this metadata.
One simple possibility is to list all stack entry ids exhaustively in each
call stack. For example:

!15 = !{i64 456, i64 234, i64 123}

!17 = !{i64 345, i64 234, i64 123}

However, as can seen by the above example, this results in a lot of
duplication. Instead, we will organize the call stack metadata as a chain
of metadata, one stack entry id per metadata, with a second operand
pointing to the next (caller) stack entry in the chain. This allows for
deduplication in the above case, which then looks like:

!15 = !{i64 456, !18}

!17 = !{i64 345, !18}

!18 = !{i64 234, !19}

!19 = !{i64 123}

While in this toy example the chaining results in overall more metadata
instructions and operands, in reality the contexts are much longer, and as
we will show later, the opportunities to deduplicate stack entries is large
in practice.

We may not be able to completely deduplicate all instances of a particular
stack entry id. If we additionally had a call stack with entries 567, 234,
678, 123, although 234 is the same id shown in the earlier call stacks, it
has a different caller (678) so we cannot share the same metadata node used
for 234 in the above metadata. We will instead need to essentially
duplicate it, like:

!20 = !{i64 567, !21}

!21 = !{i64 234, !22}

!22 = !{i64 678, !19}

The last entry above can and does point to the same (root stack entry)
metadata !19 used for entry 123 by the earlier call stacks.

A measurement of the effectiveness of this deduplication strategy for 3
large internal applications showed we could reduce the number of stack
entry ids by 75-88%. This was measured across the entire profile, so the
deduplication effectiveness will be smaller within each IR module, where
there are only a subset of call stacks with which to deduplicate.
Additionally, the deduplication strategy requires additional Metadata nodes
and operands for the caller Metadata pointers. Considering these effects
leads to estimates of 12-60% size overhead reductions, which has a smaller
lower bound but wider distribution. However, the measurements suggest this
strategy will lead to a net reduction in overhead.
Inlining

When calls are inlined after annotation, the relevant !callsite metadata
can be merged, with the ordering of the entries metadata implying the
inlined callee->caller relationships, and if relevant added onto the
inlined allocation callsite (which will then have both !callsite and
!memprof metadata.

For example, assuming we have the following IR snippets for the original
example (for simplicity, all in the same Module):

define dso_local i32 @main() local_unnamed_addr #0 !dbg !80 { …

  tail call void @_Z3foov(), !dbg !121, !callsite !1

  …

define dso_local void @_Z3foov() local_unnamed_addr #0 !dbg !81 { …

  tail call void @_Z3bazv(), !dbg !111, !callsite !2

  …

define dso_local void @_Z3bazv() local_unnamed_addr #0 !dbg !82 { …

  ; if (x)

  tail call void @_Z3barv(), !dbg !11, !callsite !3

  …

  ; else

  tail call void @_Z3barv(), !dbg !12, !callsite !4

  …

define dso_local void @_Z3barv() local_unnamed_addr #0 !dbg !258 { …

  %call = tail call noalias align 16 dereferenceable_or_null(4) i8*
@malloc(i64 4) #4, !dbg !261, !memprof !13

  …

!1 = !{i64 123}

!2 = !{i64 234}

!3 = !{i64 345}

!4 = !{i64 456}

!13 = !{!14, !16}

!14 = !{!15, i32 2, i32 4, i32 4, i32 4, i64 10, i64 5, i64 15, i32 30, i32
20, i32 40, i32 0, i32 0}

!15 = !{i64 456, !18}

!16 = !{!17, i32 1, i32 4, i32 4, i32 4, i64 5, i64 1, i64 10, i32 10, i32
10, i32 10, i32 0, i32 0}

!17 = !{i64 345, !18}

!18 = !{i64 234, !19}

!19 = !{i64 123}

If we inlined bar() into both calls to baz(), and then into foo() and from
there into main() - i.e. all the way up to the root of the call stack - the
resulting allocation calls which are eventually inlined into main() would
look like:

define dso_local i32 @main() local_unnamed_addr #0 !dbg !8 { …

  ; if (x)

  %call = tail call noalias align 16 dereferenceable_or_null(4) i8*
@malloc(i64 4) #4, !dbg !26, !memprof !16, !callsite !17

  …

  ; else

  %call = tail call noalias align 16 dereferenceable_or_null(4) i8*
@malloc(i64 4) #4, !dbg !27, !memprof !14, !callsite !15

…

!14 = !{!15, i32 2, i32 4, i32 4, i32 4, i64 10, i64 5, i64 15, i32 30, i32
20, i32 40, i32 0, i32 0}

!15 = !{i64 456, !18}

!16 = !{!17, i32 1, i32 4, i32 4, i32 4, i64 5, i64 1, i64 10, i32 10, i32
10, i32 10, i32 0, i32 0}

!17 = !{i64 345, !18}

!18 = !{i64 234, !19}

!19 = !{i64 123}

In the above case since we inlined all the way up to root main, the merged
!callsite metadata on each inlined allocation callsite is the same as the
list in the “Stack” metadata on the associated !memprof metadata, so we can
reuse them (!17 and !15) on the inlined allocation !callsite metadatas.

Note that the original allocation had multiple memprof metadata
corresponding to different stack contexts. When we inline, we should prune
those from the inlined allocation call that do not start with the stack
subsequence described in the concatenated stack entry profile metadata. In
the above example, the inlined callsites have contexts implied by their new
!callsite metadata that match exactly one of the original MIBs, and we
prune the other. If we had not inlined all the way up into the root main,
we might have multiple MIB entries whose stack contexts include the inlined
stack entry sequence in the !callsite metadata and they should all be kept
on the inlined allocation callsite.

As another example, if we only inlined bar into baz the inlined allocations
in baz would look like:

define dso_local void @_Z3bazv() local_unnamed_addr #0 !dbg !82 { …

  ; if (x)

  %call = tail call noalias align 16 dereferenceable_or_null(4) i8*
@malloc(i64 4) #4, !dbg !26, !memprof !16, !callsite !3

  …

  ; else

  %call = tail call noalias align 16 dereferenceable_or_null(4) i8*
@malloc(i64 4) #4, !dbg !27, !memprof !14, !callsite !4

  …

!3 = !{i64 345}

!4 = !{i64 456}

!13 = !{!14, !16}

!14 = !{!15, i32 2, i32 4, i32 4, i32 4, i64 10, i64 5, i64 15, i32 30, i32
20, i32 40, i32 0, i32 0}

!15 = !{i64 456, !18}

!16 = !{!17, i32 1, i32 4, i32 4, i32 4, i64 5, i64 1, i64 10, i32 10, i32
10, i32 10, i32 0, i32 0}

!17 = !{i64 345, !18}

!18 = !{i64 234, !19}

!19 = !{i64 123}

In this case the !callsite metadata on the inlined allocations is the same
as that on the original calls to bar(), i.e. a single stack entry id each,
since we haven’t inlined across multiple calls with !callsite metadata. But
we can still prune the MIB with stack contexts that don’t start with that
of the !callsite metadata on the inlined call.

And alternatively if we only inlined foo into main, main’s call of bar
would now have !callsite metadata that has a child pointer to the !callsite
metadata of the now inlined call to foo() (!1):

define dso_local i32 @main() local_unnamed_addr #0 !dbg !8 {

…

  tail call void @_Z3barv(), !dbg !11, !callsite !2

…

}

!1 = !{i64 123}

!2 = !{i64 234, !1}

Keeping the !callsite metadata from the inlined callsites and concatenating
them in an upward call stack order will be important when we later
determine what type of cloning or other context disambiguating
optimizations are required.

[1] RFC: Sanitizer-based Heap Profiler (
https://lists.llvm.org/pipermail/llvm-dev/2020-June/142744.html)

[2] RFC: A binary serialization format for MemProf (
https://lists.llvm.org/pipermail/llvm-dev/2021-September/153007.html)
[3] Implement interfaces for providing access frequency hints to TCMalloc (
https://github.com/google/tcmalloc/commit/ab87cf382dc56784f783f3aaa43d6d0465d5f385
)

-- 
Teresa Johnson |  Software Engineer |  tejohnson at google.com |
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211027/651d930c/attachment-0001.html>