[llvm-dev] RFC: A binary serialization format for MemProf

Mon Oct 4 18:42:52 PDT 2021

Hi Hongtao,

Consider the following example with two contexts -

foo // This function is hot
   bar
       malloc()
baz // This function is cold
   bar
       malloc()

The profile loader will annotate the call to malloc() in the IR with
two contexts and their characteristics. Since one context is hot and
the other is cold, their characteristics differ (as David noted) and
we will not merge the contexts during profile processing. Now there
are a few ideas on how the allocator can determine whether this is a
hot or cold allocation at runtime --

1. Static deduplication via cloning - we can clone bar and rewrite the
call to malloc with a special call which indicates that it is cold.
The second example above would then look like --
baz
   bar_cold
       malloc_cold()
While this involves code duplication potentially increasing
icache/itlb footprint, for cold enough contexts we can tune the
threshold so that the benefit outweighs the cloning costs.

2. Parameterization - we can parameterize bar to carry additional
information that this current context is cold. Thus the code would
look like this --
baz
  bar_parameterized (/*is_cold_context=*/ true)
    if (is_cold_context) malloc_cold()
    else malloc()
This will lead to code bloat on hot paths. This can also lead to a
large amount of parameterization when there are interleaving cold
contexts, increasing register pressure along hot paths. An optimized
approach may be able to pack the information using some encoding.

3. Runtime calling context identification - As you suggested, the
allocator can identify the heap object using the calling context. An
implementation might look like this --
baz
  bar
     malloc()
        id = get_context()
        if (is_context_cold(id)) malloc_cold
        else ...
I believe the overheads of this approach is fairly high since the
context identification will happen at each dynamic call. E.g Sumner et
al measured the overhead to be ~2% overall for medium size programs in
"Precise Calling Context Encoding". We anticipate runtime
identification of calling contexts on large workloads to be
prohibitively high.

Note that these are just a few ideas and we are currently leaning
towards (1). Happy to hear about any motivating data you may have for
these approaches, though an in-depth discussion of this should
probably be reserved for an RFC which Teresa will share soon.

On Mon, Oct 4, 2021 at 5:52 PM Hongtao Yu <hoy at fb.com> wrote:
>
> Hi Snehasish, Teresa and David,
>
> Thanks for the information. I have another question about the optimized (pass2) build. Does the runtime heap allocator identify a heap object using calling contexts too? Would sort of virtual unwinding plus processing of debug inline contexts needed?
>
> Thanks,
> Hongtao
>
> ________________________________
> From: Snehasish Kumar <snehasishk at google.com>
> Sent: Monday, October 4, 2021 5:37 PM
> To: Hongtao Yu <hoy at fb.com>
> Cc: Teresa Johnson <tejohnson at google.com>; Andrey Bokhanko <andreybokhanko at gmail.com>; llvm-dev <llvm-dev at lists.llvm.org>; Vedant Kumar <vsk at apple.com>; Wenlei He <wenlei at fb.com>; David Li <davidxl at google.com>
> Subject: Re: RFC: A binary serialization format for MemProf
>
> Hi Hongtao,
>
> > How are recursive allocation contexts stored? Wondering if there’s any recursive compression performed. For example, a tree-based construction algorithm may create tree nodes recursively. Is each tree node object modeled by its unique dynamic context?
> There is no special handling of recursive calling contexts, we store the entire unique dynamic calling context as the identifier.
>
> > Will the contexts of a leaf function merged during compilation when the leaf function is not inlined? If so, where does the merging happen?
> During compilation, each allocation site may be annotated with one or more heap allocation info blocks each identified by a unique dynamic calling context. We will not merge heap profile information across unique contexts as one of our immediate goals is to distinguish between hot and cold allocation contexts. The mechanism to distinguish the allocation contexts involve cloning or parameterization and Teresa will present the details in an upcoming RFC.
>
>
>
> On Mon, Oct 4, 2021 at 8:53 AM Than McIntosh <thanm at google.com> wrote:
>
>
> >>I don't think the gc compiler even involves llvm as it is written in Go.
>
> Correct.
>
> >>I'm not personally very familiar with Go compiler toolchains and their roadmaps, but Than can probably comment.
>
> I don't see any reason why something similar to what Teresa and Snehasish are proposing couldn't be implemented for the Go gc-based toolchain (with a significant amount of effort)-- from my reading it looks fairly language independent.
>
> True, as previously pointed out, the gc-based Go toolchain currently doesn't support ASAN and lacks any sort of PGO/FDO capability, but this is not written in stone.  FDO support, along with improving the compiler back end to exploit profile data (via inlining, basic block layout, etc) is something that could be added if need be. Go's priorities have simply been different from those of C/C++.
>
> >IMHO, there is an intrinsic value of data formats being unified among different toolchains -- as very well demonstrated by DWARF
>
> Comparison with DWARF seems a bit odd here. I agree that unified formats can be useful, but I would point out that there is a great deal of administrative overhead associated with standards like DWARF (committee meetings, heavyweight processes for reaching consensus on new features, release cycles measured in years, etc).
>
> Go (for example) uses its own object file format, as opposed to using an existing standard format (e.g. ELF or PE/COFF).  The ability to modify and evolve the object file format is a huge enabler when it comes to rolling out new features.  It was a key element in the last two big Go projects I've worked on; had we been stuck with an existing object file format, the work would have been much more difficult.
>
> Than
>
> On Mon, Oct 4, 2021 at 10:55 AM Teresa Johnson <tejohnson at google.com> wrote:
>
> +Than McIntosh again to comment on the gc question below.
>
> On Mon, Oct 4, 2021 at 2:38 AM Andrey Bokhanko <andreybokhanko at gmail.com> wrote:
>
> Thanks Teresa and others for the clarification!
>
> On Fri, Oct 1, 2021 at 8:32 PM Teresa Johnson <tejohnson at google.com> wrote:
>
> I was going to respond similarly, and add a note that it isn't clear that gollvm (LLVM-based Go compiler) supports either PGO or the sanitizers, so that may be more difficult than Rust which does. As Snehasish notes, we are focused on C/C++, but this will all be done in the LLVM IR level and should be language independent in theory.
>
>
> Let me note that I specifically meant gc (Google's standard Go compiler), not gollvm. IMHO, there is an intrinsic value of data formats being unified among different toolchains -- as very well demonstrated by DWARF.
>
> (Yes, I'm aware that gc doesn't support even ages-long instruction profiling. One of the reasons is the apparent lack of implemented optimizations that can directly benefit from profiling. In case of memory profiling, the use case is clear. Also, given that BOLT helps Go a lot (up to +20% speed-up on our internal tests), I expect the same for memory profiling, which will warrant extending gc capabilities to use MemProf format.)
>
>
> I don't think the gc compiler even involves llvm as it is written in Go. So that's definitely outside the scope of our work. I'm not personally very familiar with Go compiler toolchains and their roadmaps, but Than can probably comment.
>
> Teresa
>
>
> Yours,
> Andrey
>
>
> Teresa
>
> On Fri, Oct 1, 2021 at 10:25 AM Snehasish Kumar <snehasishk at google.com> wrote:
>
> Hi Andrey,
>
> The serialization format is language independent, though our focus is C/C++. Note that our instrumentation is based on the LLVM sanitizer infrastructure and should work for Rust (supports building with sanitizers [1]). We have not considered using the data profile for non-C/C++ codes.
>
> Regards,
> Snehasish
>
> [1] https://doc.rust-lang.org/beta/unstable-book/compiler-flags/sanitizer.html
>
> On Fri, Oct 1, 2021 at 9:14 AM Andrey Bokhanko <andreybokhanko at gmail.com> wrote:
>
> Hi Snehasish, David and Theresa,
>
> I'm really glad to see the steady progress in this area!
>
> It looks like the format is pretty much language independent
> (correct?) -- so it can be applied not only to C/C++, but other
> languages (Rust) and even toolchains (Go) as well? If you have already
> considered using data profile for non-C/C++, may I kindly ask you to
> share your thoughts on this?
>
> Yours,
> Andrey
> ===
> Advanced Software Technology Lab
> Huawei
>
> On Thu, Sep 30, 2021 at 1:17 AM Snehasish Kumar <snehasishk at google.com> wrote:
> >
>
>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |
>
>
>
> --
> Teresa Johnson |  Software Engineer |  tejohnson at google.com |