[llvm-dev] [RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation

Wei Mi via llvm-dev llvm-dev at lists.llvm.org
Fri Aug 7 21:55:58 PDT 2020


On Fri, Aug 7, 2020 at 6:18 PM Wenlei He <wenlei at fb.com> wrote:

> Thanks for the feedbacks and questions, Wei. See my replies inline.
>
>
>
> *From: *Wei Mi <wmi at google.com>
> *Date: *Friday, August 7, 2020 at 5:32 PM
> *To: *Wenlei He <wenlei at fb.com>
> *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>, Xinliang David
> Li <davidxl at google.com>, Hongtao Yu <hoy at fb.com>
> *Subject: *Re: [RFC] Context-sensitive Sample PGO with
> Pseudo-Instrumentation
>
>
>
> Thanks for the proposal and the performance improvement over existing
> AutoFDO is impressive.
>
>
>
> On Fri, Aug 7, 2020 at 11:28 AM Wenlei He <wenlei at fb.com> wrote:
>
> Hi All,
>
> Our team at Facebook is building a new context-sensitive Sample PGO as an
> alternative to the existing AutoFDO. We’d like to share our motivation,
> propose a new design, and reveal preliminary results on benchmarks. We will
> refer to the proposed design as CSSPGO in this RFC.
>
>
>
> The new CSSPGO leverages simultaneous LBR and stack sampling to construct
> a full context-sensitive profile. It doesn’t rely on previous inlining like
> today’s AutoFDO to get context-sensitive profile, and it also doesn’t need
> a separate post-inline context-sensitive profile like CSPGO. In addition,
> we introduced pseudo-instrumentation for more accurate mapping from binary
> samples back to IR, similar to instrumentation PGO, but without any
> measure-able runtime overhead that is usually associated with
> instrumentation.
>
>
>
>
>
> We have a functioning implementation for the new CSSPGO now. Initial
> results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO
> (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.
>
>
>
>
>
> *Motivation*
>
> AutoFDO is a big success as it lowers the entry barrier for PGO
> significantly while still delivering substantial performance boost.
> However, there’s still a gap between AutoFDO and its instrumentation
> counterpart. From several failed internal attempts to improve AutoFDO, we
> realized that the bottleneck of AutoFDO lies in its profile quality. With
> the current level of profile quality, it’s difficult to reap more
> performance win because good heuristics are often limited by inferior
> profile. That prompted a systemic effort to investigate and improve AutoFDO
> framework. Specifically, we’re trying to handle the two biggest sources of
> profile quality issues:
>
>
>
>    1. AutoFDO relies on a limited context-sensitive profile collected
>    based on previous inlining. Therefore it can only replay or prune the
>    previous inlining. With the main CGSCC inliner, post-inline counts are not
>    accurate due to scaling of context-less profile, which affects the
>    effectiveness of later passes such as profile-guided code layout.
>    2. Dwarf line and discriminator info aren’t always well-maintained
>    throughout the compilation, thus using them as anchors to map binary
>    samples back to the IR can sometimes be inaccurate, which leads to inferior
>    profile quality and limits PGO performance.
>
> Acknowledge to issues. We also found an issue that current AFDO profile
> doesn't keep edge information and that leads to nonoptimal profile in some
> cases. Since profile format is needed to be redesigned for component 1, I
> am thinking whether it is possible to extend the profile format in a way so
> it can incorporate edge information as well.
>
>
>
> [wenlei] Yes, we have implemented an “add-on” that could encode edges in
> addition to probes/blocks in .pseudo_probe section, and we also have a way
> to represent edges in new profile. But that’s not critical for the
> framework and initial evaluation, which is why it’s not mentioned in this
> RFC. We did that mostly for enabling offline count inference algorithm
> experiments. We will share more details on that later. Curious what is the
> issue you saw due to lack of edge info?
>

There are critical edges in the CFG. Compiler cannot infer all the edge
counts based on bb counts when critical edges are involved, so the
prababilities of some branches are imprecise.


>
> About pseudo probe, seemly you doesn't mention in this proposal but does
> it still provides the ability to solve the source drift issue you mentioned
> before? If it does, how it is achieved?
>
>
>
> [wenlei] Pseudo-probe handles source drift reasonably well, and has good
> resilience against source changes. It can tolerate any source changes that
> doesn’t alter CFG, so the issues we ran into with line-based approach where
> deleting a comment lead to big regression isn’t going to happen with
> pseudo-probe. For changes that does alter CFG, we could also employ fuzzy
> CFG matching in future. Bottom line is using probe and CFG as profile
> carrier inherently has richer info, so it’s easier for PGO to see through
> the source changes and can still make sense of a stale profile. (We didn’t
> expand on the source drift issue in initial RFC, but I just mentioned that
> part in my reply to David, as secondary motivation for pseudo-probe.)
>
>
>
I see.


>
>
>
>    1.
>
> To lift the above limitations, we’d like to propose an alternative design
> that consists of two components: 1) Context-sensitive sample PGO, 2) Sample
> to IR mapping using pseudo probes. The goal is to further improve sample
> PGO performance while maintaining usability and keeping training runtime
> overhead at zero. In addition, we hope the CSSPGO framework can also open
> up opportunities for new optimizations with more stringent requirements on
> profile quality.
>
>
>
> I like both ideas, and those two components can be orthogonal? For the
> first component, I hope the existing debug information based AutoFDO can be
> benefited from it as well, with some extension to the current profile
> format.
>
>
>
> [wenlei] Thanks. Yes, they’re orthogonal. But we need both for peak
> performance, and we want to focus tuning effort on the combination. Also
> see my reply to David’s questions.
>
>
>

That is great!


>
>
>
>
> *Context-sensitive Sample PGO*
>
> The effectiveness of BOLT, Propeller and CSPGO all demonstrated the
> importance of context-sensitive profile for PGO. However there are two
> limitations with the existing approaches.
>
>    1. The current solutions focus on leveraging a context-sensitive
>    profile to attain an accurate post-inline profile that helps achieve a
>    better code layout, but do not use the context-sensitive profile to drive
>    better inlining.
>    2. The current solutions involve multiple training processes and
>    profiles (e.g. a post-inline profile for CSPGO, or a post-link profile for
>    BOLT and Propeller), which incurs higher operational cost and complicates
>    the build and release workflow.
>
> We propose a full context-sensitive sample profiling infrastructure that
> utilizes both LBR and call stack samples at the same time to synthesize a
> profile with a full context sensitivity. The key advantage is that rather
> than relying on previous inlining or a separate profile, the profile
> collected with the new approach will have full calling contexts recovered
> from both inlined and not inlined call sites. To achieve an accurate
> post-inline profile, a separate profile is no longer needed. Instead, the
> post-inline profile can be directly derived from adjusting the input
> profile based on all inline decisions. The richer context-sensitive profile
> also enables better inline decisions. The infrastructure has two key
> components listed below.
>
>
>
> *Synthesizing context-sensitive LBR with a virtual unwinder*
>
> To make sample PGO’s input profile context aware, we need to know the call
> stack of each LBR fall through path. That is done by sampling LBR and call
> stack simultaneously. With that, each sample will contain a call stack in
> addition to LBR entries. We use level 2 PEBS to control sampling skid so
> that the leaf frame from stack sample aligns with leaf frame from LBR. The
> raw call stack sample describes the calling context for the leaf LBR entry.
> In addition, by unwinding “call” and “return” (including implicit ones from
> inlinee) from LBR entries backwards on top of raw stack samples, we can
> recover the calling context for each of the LBR entries from the sample,
> thus synthesizing context-sensitive LBR profile.
>
>
>
>
>
> What if the stack unwinding is not intact? For example, tail call
> optimization may cause unwinding issue currently in perf. framepointer or
> call frame information may not be properly maintained.
>
>
>
> [wenlei] That’s a good question. Currently, we have frame pointer
> optimization (FPO) and tail call optimization disabled for experiments. FPO
> is disabled for our production builds as well, so it’s not a problem for
> us. For tail call, we’ll need to evaluate the cost-benefit and see what we
> can do. We know there’s heuristic to recover single missing frame due to
> tail call, which we haven’t implemented yet; beyond that, perhaps we can
> revisit leveraging dwarf unwinding, or live with either imperfect profile
> or tail call disabled. We also implemented special case for sample that
> lands in prolog and epilog where frame chain isn’t ready. However, even
> with both FPO and tail call disabled, we still see truncated stack samples,
> which we’re investigating. But the perf results are with profiles
> containing truncated/imperfect stack samples, so it looks like a small
> portion of imperfect profile doesn’t impact the effectiveness of CSSPGO too
> much.
>
>
>

Thanks.

> We can then generate context-sensitive sample PGO profile using the
> context-sensitive LBR profile. In the new profile, a function’s profile
> becomes a collection of profiles, each representing a profile for a given
> calling context.
>
>
>
> Will the profile size be significantly larger?
>
>
>
> [wenlei] Currently text CS profile is 1-10x larger. But there’re ways to
> bring them it and we’re working on it: 1) trim cold context, 2) leverage
> compression from extended binary (should be effective for context strings
> that has duplicated long C++ mangle names), 3) consider fixed-length
> integer context presentation, e.g. rolling hash. Also see my replies to
> David’s question on this.
>

Trimming cold contexts could be very effective.

>
>
> *Context-sensitive FDO/PGO framework in LLVM*
>
> In order to leverage context-sensitive profile for inlining, and to
> maintain accurate post-inline counts, we introduced SampleContextTracker which
> is a layer sitting in between input profile and the profile used to
> annotate CFG for optimizations. We also introduced the notion of base
> profile which is the merged profile for function’s profiles from any
> outstanding (not inlined) context, and context profile which is a
> function's profile for a given calling context. The framework includes four
> simple APIs for updating and query profiles:
>
>
>
> Query API:
>
>    - getBaseSamplesFor: Query base profile by function name.
>    - getContextSamplesFor: Query context profile by calling context and
>    function name.
>
> Update API:
>
>    - MarkContextSamplesInlined: When a function is inlined for a given
>    calling context, we need to mark the context profile for that context as
>    inlined. This is to make sure we don't include inlined context profile when
>    synthesizing the base profile.
>    - PromoteMergeContextSamplesTree: When a function is not inlined for a
>    given calling context, we need to promote the context profile tree to be
>    top-level context. This preserves the child context under that function so
>    later inline decisions for calls originating from that not inlined function
>    will still be driven by an accurate context profile.
>
> These APIs are used by SampleProfileLoader’s inlining, for better inline
> decisions and better post-inline counts. For optimal results, the new
> infrastructure needs to work with a top-down FDO inliner. We added top-down
> FDO inlining under a switch, and the switch is turned on by default in
> upstream recently. There’re a few other improvements for the FDO inliner
> that we plan to upstream soon.
>
>
>
>
>
> *Pseudo-instrumentation for sample to IR mapping*
>
> Being able to profile production binaries is a key advantage of AutoFDO
> over Instrumentation PGO, but it also comes with a big challenge. While
> using line number and discriminator as anchor for profile mapping incurs
> zero run time overhead for AutoFDO, it’s not as accurate as instrumented
> probes. This is because the instrumented probes are part of the IR, rather
> than metadata attached to the IR like !dbg. That has two implications: 1)
> it’s easier to maintain IR than metadata for optimization passes; 2) probe
> blocks some CFG transformations that can mess up profile correlation.
>
>
>
> With the proposed pseudo instrumentation, we can achieve most of the
> benefit of instrumentation PGO in little runtime overhead. We instrument
> each basic block with a pseudo probe associated with the block Id. Unlike
> in PGO instrumentation where a counter is implemented as a persisting
> operation such as atomic read/write or runtime helper call, a pseudo probe
> is implemented as a dedicated intrinsic call with IntrInaccessibleMemOnly flag.
> The intrinsic comes with most of the semantics of a PGO counter but is
> much less optimization-intrusive.
>
>
>
> The pseudo probe intrinsic calls are on the IR throughout the optimization
> and code generation pipeline and are materialized as a piece of binary data
> stored in a separate .pseudo_probe data section. The section is then used
> to map binary samples back to blocks of CFG during profile generation.
> There are also no real machine instructions generated for a pseudo probe
> and the.pseudo_probe section won’t be loaded into memory at runtime,
> therefore they should incur very little runtime overhead. As a fact, we see
> no measure-able performance impact from pseudo-instrumentation itself on
> SPEC2006 or big internal workload.
>
>
>
> *Pseudo-instrumentation integration and Pass Ordering*
>
> One implication from pseudo-probe instrumentation is that the profile is
> now sensitive to CFG changes. We now defect stale profiles for sample PGO
> via CFG checksum, instead of just using it. However, the potential downside
> is that CFG may change between different versions of the compiler even if
> the source code is unchanged. To solve that problem, we perform the pseudo
> instrumentation very early in the pre-LTO pipeline, before any CFG
> transformation. This ensures that the CFG instrumented and annotated is
> stable. We added SampleProfileProber that performs the pseudo
> instrumentation and runs independent of profile annotation.
>
>
>
> A new switch -fpseudo-probe-for-profiling is added to enable sample PGO
> with pseudo instrumentation, similar to -fdebug-info-for-profiling for
> AutoFDO. Input profile is still provided through the same switch used by
> today’s AutoFDO, namely -fprofile-sample-use, and the profile loader will
> automatically decide how to load and annotate profile depending on whether
> input profile is dwarf-based or pseudo-probe based.
>
>
>
>
>
> *New profile format and profile generation*
>
> We extend current profile format in order to be able to represent a full
> context-sensitive profile and also encode pseudo-probe info. This is done
> without drastically diverging from today’s AutoFDO profile format so that
> existing tools and libraries can be reused with minor changes (e.g.
> llvm-profdata, profiler reader and writer).
>
>
>
> For a context-sensitive profile, we extend the profile format by changing
> the function profile header line to include calling context in addition to
> function name. With today’s AutoFDO, we have a single profile header for
> each function to represent its accumulative profile. E.g. This is the
> profile header for foo, with 1290 total samples, and 74 header samples.
>
>
>
> foo:1290:74
>
>
>
> For CSSPGO, we will have multiple profile headers for a single function’s
> profile, each representing profile for a specific calling context as shown
> below. CSSPGO profile header is bracketed to differentiate from today’s
> AutoFDO.
>
>
>
> [main:12 @ bar:3 @ foo]:279:54
>
> [main:19 @ zoo:7 @ foo]:1011:20
>
>
>
> With calling context encoded in the function header, we no longer need a
> nested function profile for inlinees. Instead, a context profile will be
> represented uniformly using context strings in the function profile header,
> regardless of whether the calls in the context are inlined or not. The flat
> structure makes sure that context profile is easily indexable. The change
> is mostly transparent to tools like llvm-profdata. Support for binary
> profile format has not been added yet, but should be easy to do.
>
>
>
> For pseudo-probe, we repurposed the line to count map of AutoFDO profile
> to be block Id to count map. This only changes the interpretation of
> profile content rather than the representation, hence all reader/writer
> helpers can be reused.
>
>
>
> There's a new profile generation tool, llvm-profgen, with the virtual
> winder implemented for context-sensitive profiling, and uses the
> .pseudo_probe section to map binary profile to pre-opt CFG profile. Since
> profile generation is a critical piece of the workflow, we’d like to
> propose to include the tool as part of LLVM, alongside with llvm-profdata.
>
>
>
>
>
> *Preliminary Results*
>
> To quantitatively assess profile quality improvement brought by
> pseudo-instrumentation, we introduce a profile quality metric. We measure
> the metric by first annotating an optimized binary with the MIR block
> execution counts derived from a profile. The binary is then sampled and the
> counts are compared against the annotation. The weighted relative delta is
> used as an indicator for profile quality (lower is better).
>
>
>
> Table below shows the profile quality metric for SPEC2006. We can see from
> the numbers that the profile quality of pseudo-instrumentation sample PGO
> is much better than AutoFDO and close to instrumentation PGO.
>
>
>
> Profile quality metric
>
> Baseline AutoFDO
>
> Instrumentation PGO
>
> Sample PGO w/ Pseudo Instrumentation
>
> SPEC2006
>
> 24.58%
>
> 15.70%
>
> 16.21%
>
>
>
> We also measured performance and code size on SPEC2006 with CSSPGO. The
> measurement was done with MonoLTO and new pass manager, with tuning for FDO
> inliner to accommodate context-sensitive profile, and used training dataset
> for both pass1 and pass2. The result shows ~2% performance win on top of
> today’s AutoFDO, with ~4% .text reduction, see table below.
>
>
>
> SPEC2006
>
> Performance
>
> Code Size
>
> AutoFDO over LTO
>
> CSSPGO
>
> Over LTO
>
> CSSPGO over AutoFDO
>
> AutoFDO over LTO
>
> CSSPGO
>
> Over LTO
>
> CSSPGO over AutoFDO
>
> Geomean Delta %
>
> 6.80%
>
> 8.70%
>
> 2.04%
>
> 11.17%
>
> 6.66%
>
> 4.06%
>
>
>
> While the SPEC2006 benchmark suite is different from large workloads, we
> think the results demonstrated the potential of CSSPGO and served its
> purpose for proof of concept. We plan to continue tuning and start to
> evaluate larger internal workloads soon, and we’d like to upstream our
> work. Feedbacks are welcomed!
>
>
>
>
>
> Thanks,
>
> Wenlei & Hongtao
>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200807/70588cec/attachment.html>


More information about the llvm-dev mailing list