[llvm-dev] RFC: A binary serialization format for MemProf

Thu Oct 7 12:59:28 PDT 2021

Thanks for the reply and clarification. Having a single combined IR instrumentation and PGHO instrumentation sounds good.

I’m also wondering if you have any data you could share that tells the overall benefit of memprof driven optimization since last RFC, perhaps with some early prototype and on small/synthetic workload? Asking because even though this all looks promising, from runtime support to binary format, later profile loader and optimization, there’s non-trivial complexity being added to a few places.

Thanks,
Wenlei

From: Snehasish Kumar <snehasishk at google.com>
Date: Thursday, October 7, 2021 at 12:06 PM
To: Xinliang David Li <davidxl at google.com>
Cc: Wenlei He <wenlei at fb.com>, llvm-dev <llvm-dev at lists.llvm.org>, Vedant Kumar <vsk at apple.com>, andreybokhanko at gmail.com <andreybokhanko at gmail.com>, Teresa Johnson <tejohnson at google.com>, Hongtao Yu <hoy at fb.com>
Subject: Re: RFC: A binary serialization format for MemProf
Hi Wenlei,

Thanks for taking a look! Added responses inline.

On Thu, Oct 7, 2021 at 9:29 AM Xinliang David Li <davidxl at google.com> wrote:
>
> Just a quick note -- IRPGO profile is not deterministic with multi-threaded programs due to contentions (there is of course atomic update mode, but it can be slow). Asynchronous dumping is another reason that the profile is not guaranteed to be repeatable.
>
> David
>
> On Thu, Oct 7, 2021 at 9:18 AM Wenlei He <wenlei at fb.com> wrote:
>>
>> Thanks for sharing the progress and details on the binary format. Overall this looks like a clean design that fits current PGO profile format with extensions.
>>
>>
>>
>> Some high level comments:
>>
>>
>>

Our focus is to have a single combined IR instrumentation and PGHO
instrumentation phase to keep operational costs low. For CSPGO today,
this would be the second IR instrumentation phase. We also intend to
support a separate PGHO instrumentation phase.
>> Does memprof/PGHO work together with today's IRPGO today, i.e. can we have one instrumented build to collect both PGO and PGHO profile, or we will need separate PGO instrumentation builds for each, in which case CSPGO + PGHO would need three iterations of training and build, which would be significant operational cost..

Yes, the context tracker is quite relevant to the IR matching need.
Teresa will share the detailed design soon and we can evaluate the
benefit of reusing the existing logic for CSSPGO. I think this is
orthogonal to this RFC (serialization format) so we can defer to the
next one for a detailed discussion.
>> I think some of the problems memprof faced when dealing with storing calling context and mapping context to IR is very similar to CSSPGO. I'm wondering if it makes sense to promote some existing infrastructure to be more general beyond just serving CSSPGO. One example is the IR mapping you mentioned (quoted below). In CSSPGO, we have the exact same need, and it's handled by `SampleContextTracker` which queries a context trie using an instruction/DILocation.
>>
>>
>>
>>           >  Because the MIB corresponding to the A->B context is associated with function B in the profile, we do not find it by looking at function A’s profile when we see function A’s malloc call during matching. To address this we need to keep a correspondence from debug locations to the associated profile information.
>>
>>
>>

We intend to retain as much of the calling context information until
the IR matching. This is where we can leverage common solutions. We
would be happy to generalize where appropriate and intend to tackle
this topic in detail in the next RFC.
>> The serialization of calling context, pruning of calling context are also example of shared problems, and we've put in some effort to have effective solutions (e.g. offline preinliner for most effective pruning, which I think could be adapted to help keep most important allocation context). Perhaps some of the frameworks can be merged, so LLVM has general context aware PGO support that can be leverage by different kinds of PGO (IRPGO, PGHO, CSSPGO). If you think this is worth pursuing, we’d be happy to help too.
>>
>>
>>
>> More on the details:
>>
>>
>>
As David mentioned, keeping the PGHO profile deterministic is a
non-goal since IR PGO profile is non-deterministic.
>> I saw that MemInfoBlock contains alloc/dealloc cpuid, does that make memprof profile non-deterministic in the sense that running memprof twice on the exact program and input would yield bit-wise different memory profile? I think IR PGO profile is deterministic?
>>
>>
>>
We need to use the file path instead of the function to be able to
distinguish COMDAT functions. The line_offset based matching is more
resilient if the entire function is moved, I think it's a good idea
and we can incorporate it into the IR matching phase.
>> Why do we use `file:line:discriminator` instead of `func:line_offset:discriminator `? The later would be more resilient to source change. If function name string is too long, we could perhaps leverage the MD5 encoding used by sample PGO?
>>
>>
>>
While we only intend to support Memprof optimizations for the main
binary, retaining all executable mappings allow future analysis tools
to symbolize shared library code.
>> Is the design of mmap section (quoted below) trying to support memprof for multiple binaries in the same process at the same time, or mainly for handling multiple non-consecutive executable segments for a single binary?
>>
>>
>>
>>            > The process memory mappings for the executable segment during profiling are stored in this section. This allows symbolization during post processing for binaries which are built with position independent code. For now all read only, executable  mappings are recorded, however in the future, mappings for heap data can also potentially be stored.
>>
>>
Yes, we do intend to support Memprof profile section merging via
`llvm-profdata merge`. The schema overhead per function is low now, so
we opted for function granularity. We can revisit if the overheads are
high or if the IR metadata scheme intends to keep it at module
granularity (in which case we don't need the extra fidelity).
>> Do we need each function record to have its own schema, do we expect different functions to use different versions/schemas? The is very flexible, but wondering what’s the use case. If the schema is for compatibility across versions, perhaps a file level scheme would be enough?
>>
>>
>>
>>             > The InstrProfRecord for each function will hold the schema and an array of Memprof info blocks, one for each unique allocation context.
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Wenlei
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211007/63dc2981/attachment.html>