[llvm-dev] RFC: A binary serialization format for MemProf

Fri Oct 8 16:39:08 PDT 2021

On Fri, Oct 8, 2021 at 9:42 AM Snehasish Kumar <snehasishk at google.com>
wrote:

> Hi Wenlei,
>
> We are still working on an end to end prototype and do not have any
> data to share at this time. Our work is motivated by manual tuning of
> a large internal workload which leverages tcmalloc support for hotness
> based memory pooling (to be open sourced soon). With Memprof,
> preliminary analyses indicate we can automatically cover all manually
> identified cases and identify more opportunities. In the future we aim
> to support more hot-cold memory splitting optimizations such as the
> schemes described in "Software-Defined Far Memory in Warehouse-Scale
> Computers" ASPLOS 2019. We look forward to sharing more data and a
> prototype once the Memprof IR annotation and optimization consumer RFC
> has been vetted.
>

Yes, the initial goal is targeting coarse grain smart-placement for
locality and memory savings. Longer term it will also handle fine grain
placement strategies which will require more extensive interaction with the
memory allocator layer.

David

>
> Regards,
> Snehasish
>
>
>
>
> On Thu, Oct 7, 2021 at 1:00 PM Wenlei He <wenlei at fb.com> wrote:
> >
> > Thanks for the reply and clarification. Having a single combined IR
> instrumentation and PGHO instrumentation sounds good.
> >
> >
> >
> > I’m also wondering if you have any data you could share that tells the
> overall benefit of memprof driven optimization since last RFC, perhaps with
> some early prototype and on small/synthetic workload? Asking because even
> though this all looks promising, from runtime support to binary format,
> later profile loader and optimization, there’s non-trivial complexity being
> added to a few places.
> >
> >
> >
> > Thanks,
> >
> > Wenlei
> >
> >
> >
> > From: Snehasish Kumar <snehasishk at google.com>
> > Date: Thursday, October 7, 2021 at 12:06 PM
> > To: Xinliang David Li <davidxl at google.com>
> > Cc: Wenlei He <wenlei at fb.com>, llvm-dev <llvm-dev at lists.llvm.org>,
> Vedant Kumar <vsk at apple.com>, andreybokhanko at gmail.com <
> andreybokhanko at gmail.com>, Teresa Johnson <tejohnson at google.com>, Hongtao
> Yu <hoy at fb.com>
> > Subject: Re: RFC: A binary serialization format for MemProf
> >
> > Hi Wenlei,
> >
> > Thanks for taking a look! Added responses inline.
> >
> > On Thu, Oct 7, 2021 at 9:29 AM Xinliang David Li <davidxl at google.com>
> wrote:
> > >
> > > Just a quick note -- IRPGO profile is not deterministic with
> multi-threaded programs due to contentions (there is of course atomic
> update mode, but it can be slow). Asynchronous dumping is another reason
> that the profile is not guaranteed to be repeatable.
> > >
> > > David
> > >
> > > On Thu, Oct 7, 2021 at 9:18 AM Wenlei He <wenlei at fb.com> wrote:
> > >>
> > >> Thanks for sharing the progress and details on the binary format.
> Overall this looks like a clean design that fits current PGO profile format
> with extensions.
> > >>
> > >>
> > >>
> > >> Some high level comments:
> > >>
> > >>
> > >>
> >
> > Our focus is to have a single combined IR instrumentation and PGHO
> > instrumentation phase to keep operational costs low. For CSPGO today,
> > this would be the second IR instrumentation phase. We also intend to
> > support a separate PGHO instrumentation phase.
> > >> Does memprof/PGHO work together with today's IRPGO today, i.e. can we
> have one instrumented build to collect both PGO and PGHO profile, or we
> will need separate PGO instrumentation builds for each, in which case CSPGO
> + PGHO would need three iterations of training and build, which would be
> significant operational cost..
> >
> > Yes, the context tracker is quite relevant to the IR matching need.
> > Teresa will share the detailed design soon and we can evaluate the
> > benefit of reusing the existing logic for CSSPGO. I think this is
> > orthogonal to this RFC (serialization format) so we can defer to the
> > next one for a detailed discussion.
> > >> I think some of the problems memprof faced when dealing with storing
> calling context and mapping context to IR is very similar to CSSPGO. I'm
> wondering if it makes sense to promote some existing infrastructure to be
> more general beyond just serving CSSPGO. One example is the IR mapping you
> mentioned (quoted below). In CSSPGO, we have the exact same need, and it's
> handled by `SampleContextTracker` which queries a context trie using an
> instruction/DILocation.
> > >>
> > >>
> > >>
> > >>           >  Because the MIB corresponding to the A->B context is
> associated with function B in the profile, we do not find it by looking at
> function A’s profile when we see function A’s malloc call during matching.
> To address this we need to keep a correspondence from debug locations to
> the associated profile information.
> > >>
> > >>
> > >>
> >
> > We intend to retain as much of the calling context information until
> > the IR matching. This is where we can leverage common solutions. We
> > would be happy to generalize where appropriate and intend to tackle
> > this topic in detail in the next RFC.
> > >> The serialization of calling context, pruning of calling context are
> also example of shared problems, and we've put in some effort to have
> effective solutions (e.g. offline preinliner for most effective pruning,
> which I think could be adapted to help keep most important allocation
> context). Perhaps some of the frameworks can be merged, so LLVM has general
> context aware PGO support that can be leverage by different kinds of PGO
> (IRPGO, PGHO, CSSPGO). If you think this is worth pursuing, we’d be happy
> to help too.
> > >>
> > >>
> > >>
> > >> More on the details:
> > >>
> > >>
> > >>
> > As David mentioned, keeping the PGHO profile deterministic is a
> > non-goal since IR PGO profile is non-deterministic.
> > >> I saw that MemInfoBlock contains alloc/dealloc cpuid, does that make
> memprof profile non-deterministic in the sense that running memprof twice
> on the exact program and input would yield bit-wise different memory
> profile? I think IR PGO profile is deterministic?
> > >>
> > >>
> > >>
> > We need to use the file path instead of the function to be able to
> > distinguish COMDAT functions. The line_offset based matching is more
> > resilient if the entire function is moved, I think it's a good idea
> > and we can incorporate it into the IR matching phase.
> > >> Why do we use `file:line:discriminator` instead of
> `func:line_offset:discriminator `? The later would be more resilient to
> source change. If function name string is too long, we could perhaps
> leverage the MD5 encoding used by sample PGO?
> > >>
> > >>
> > >>
> > While we only intend to support Memprof optimizations for the main
> > binary, retaining all executable mappings allow future analysis tools
> > to symbolize shared library code.
> > >> Is the design of mmap section (quoted below) trying to support
> memprof for multiple binaries in the same process at the same time, or
> mainly for handling multiple non-consecutive executable segments for a
> single binary?
> > >>
> > >>
> > >>
> > >>            > The process memory mappings for the executable segment
> during profiling are stored in this section. This allows symbolization
> during post processing for binaries which are built with position
> independent code. For now all read only, executable  mappings are recorded,
> however in the future, mappings for heap data can also potentially be
> stored.
> > >>
> > >>
> > Yes, we do intend to support Memprof profile section merging via
> > `llvm-profdata merge`. The schema overhead per function is low now, so
> > we opted for function granularity. We can revisit if the overheads are
> > high or if the IR metadata scheme intends to keep it at module
> > granularity (in which case we don't need the extra fidelity).
> > >> Do we need each function record to have its own schema, do we expect
> different functions to use different versions/schemas? The is very
> flexible, but wondering what’s the use case. If the schema is for
> compatibility across versions, perhaps a file level scheme would be enough?
> > >>
> > >>
> > >>
> > >>             > The InstrProfRecord for each function will hold the
> schema and an array of Memprof info blocks, one for each unique allocation
> context.
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Thanks,
> > >>
> > >> Wenlei
> > >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211008/a9888cfc/attachment.html>