[llvm-dev] [RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation
Xinliang David Li via llvm-dev
llvm-dev at lists.llvm.org
Sat Aug 8 13:39:20 PDT 2020
On Sat, Aug 8, 2020 at 12:06 PM Wenlei He via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Also see my replies inline.
>
>
>
> *From: *Hongtao Yu <hoy at fb.com>
> *Date: *Saturday, August 8, 2020 at 11:25 AM
> *To: *Xinliang David Li <davidxl at google.com>
> *Cc: *Wenlei He <wenlei at fb.com>, "llvm-dev at lists.llvm.org" <
> llvm-dev at lists.llvm.org>, Wei Mi <wmi at google.com>
> *Subject: *Re: [RFC] Context-sensitive Sample PGO with
> Pseudo-Instrumentation
>
>
>
> Replied inline.
>
>
>
> *From: *Xinliang David Li <davidxl at google.com>
> *Date: *Saturday, August 8, 2020 at 10:55 AM
> *To: *Hongtao Yu <hoy at fb.com>
> *Cc: *Wenlei He <wenlei at fb.com>, "llvm-dev at lists.llvm.org" <
> llvm-dev at lists.llvm.org>, Wei Mi <wmi at google.com>
> *Subject: *Re: [RFC] Context-sensitive Sample PGO with
> Pseudo-Instrumentation
>
>
>
>
>
>
>
> On Fri, Aug 7, 2020 at 11:28 PM Hongtao Yu <hoy at fb.com> wrote:
>
> A few add-ons.
>
>
>
> *From: *Wenlei He <wenlei at fb.com>
> *Date: *Friday, August 7, 2020 at 10:34 PM
> *To: *Xinliang David Li <davidxl at google.com>
> *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>, Wei Mi <
> wmi at google.com>, Hongtao Yu <hoy at fb.com>
> *Subject: *Re: [RFC] Context-sensitive Sample PGO with
> Pseudo-Instrumentation
>
>
>
> See my answers inline.
>
>
>
> *From: *Xinliang David Li <davidxl at google.com>
> *Date: *Friday, August 7, 2020 at 7:57 PM
> *To: *Wenlei He <wenlei at fb.com>
> *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>, Wei Mi <
> wmi at google.com>, Hongtao Yu <hoy at fb.com>
> *Subject: *Re: [RFC] Context-sensitive Sample PGO with
> Pseudo-Instrumentation
>
>
>
>
>
>
>
> On Fri, Aug 7, 2020 at 4:44 PM Wenlei He <wenlei at fb.com> wrote:
>
> Thanks for the thoughtful questions, David. See my answers inline.
>
>
>
> Thanks,
>
> Wenlei
>
>
>
> *From: *Xinliang David Li <davidxl at google.com>
> *Date: *Friday, August 7, 2020 at 1:24 PM
> *To: *Wenlei He <wenlei at fb.com>
> *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>, Wei Mi <
> wmi at google.com>, Hongtao Yu <hoy at fb.com>
> *Subject: *Re: [RFC] Context-sensitive Sample PGO with
> Pseudo-Instrumentation
>
>
>
> Wenlei, Thanks for the interesting proposal! please see my replies inline
> below.
>
>
>
> On Fri, Aug 7, 2020 at 11:28 AM Wenlei He <wenlei at fb.com> wrote:
>
> Hi All,
>
> Our team at Facebook is building a new context-sensitive Sample PGO as an
> alternative to the existing AutoFDO. We’d like to share our motivation,
> propose a new design, and reveal preliminary results on benchmarks. We will
> refer to the proposed design as CSSPGO in this RFC.
>
>
>
> The new CSSPGO leverages simultaneous LBR and stack sampling to construct
> a full context-sensitive profile.
>
>
>
>
>
> Can you share more details on this? LBR only has 32 entries, so it won't
> give you full call context, so stack unwinding is needed. What is the
> overhead you see in production environment?
>
>
>
> [wenlei] We are not worried about overhead in production environment as
> the sampling rate there is extremely low. We did measure locally however,
> for stack sampling and level 2 PEBS on top of regular LBR sampling, the
> overheads isn’t very noticeable still, but I don’t have numbers at hand.
>
>
>
>
>
>
>
> I assume this is with no-omit-frame-pointer option right?
>
>
>
> [wenlei] Right, and tail call is off too for our experiments, but we may
> keep it on for production usage later. See my reply to Wei’s question on
> this.
>
>
>
>
>
>
>
> It doesn’t rely on previous inlining like today’s AutoFDO to get
> context-sensitive profile, and it also doesn’t need a separate post-inline
> context-sensitive profile like CSPGO.
>
>
>
> What is the sample profile data size impact with the full context
> information?
>
>
>
> [wenlei] Text CS profile is normally around 1x-10x of regular profile
> size, with all live context included. We plan to trim cold context, which
> we expect to bring the size down in a meaningful way. Another source of
> size increase is the context string, which could contain duplicated mangle
> names (can be very long for C++ templated code), but should be very
> compressible with the built-in compression support from extended binary
> profile. We will move to extended binary format, and leverage the
> compression support if needed. We can also consider more efficient
> fixed-length integer context representation (similar to rolling hash).
>
>
>
>
>
> What is the average and max number of live contexts you have seen?
> Statically it grows exponentially as the depth of the context increases.
>
>
>
> [wenlei] I guess you meant the ratio of number of live contexts to number
> of functions? I haven’t looked, but I’d expect profile size ratio to be a
> good proxy for that.
>
>
>
> In addition, we introduced pseudo-instrumentation for more accurate
> mapping from binary samples back to IR, similar to instrumentation PGO, but
> without any measure-able runtime overhead that is usually associated with
> instrumentation.
>
>
>
>
>
> Is CSSPGO inherently dependent upon pseudo-probe or is it orthogonal? I
> hope that it is the latter :)
>
>
>
> [wenlei] They’re orthogonal. Context-sensitive SPGO can work without
> pseudo-probe and still use dwarf. Our plan is to keep context-sensitive
> SPGO working w/ and w/o pseudo-probe functionality-wise, but we only look
> at perf and tune with the two combined.
>
>
>
>
>
> great.
>
>
>
>
>
> We have a functioning implementation for the new CSSPGO now. Initial
> results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO
> (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.
>
>
>
>
>
> *Motivation*
>
> AutoFDO is a big success as it lowers the entry barrier for PGO
> significantly while still delivering substantial performance boost.
> However, there’s still a gap between AutoFDO and its instrumentation
> counterpart. From several failed internal attempts to improve AutoFDO, we
> realized that the bottleneck of AutoFDO lies in its profile quality. With
> the current level of profile quality, it’s difficult to reap more
> performance win because good heuristics are often limited by inferior
> profile. That prompted a systemic effort to investigate and improve AutoFDO
> framework. Specifically, we’re trying to handle the two biggest sources of
> profile quality issues:
>
>
>
> 1. AutoFDO relies on a limited context-sensitive profile collected
> based on previous inlining. Therefore it can only replay or prune the
> previous inlining. With the main CGSCC inliner, post-inline counts are not
> accurate due to scaling of context-less profile, which affects the
> effectiveness of later passes such as profile-guided code layout.
>
>
>
> Acknowledge of the limitation here.
>
>
>
> 1.
>
> 2. Dwarf line and discriminator info aren’t always well-maintained
> throughout the compilation, thus using them as anchors to map binary
> samples back to the IR can sometimes be inaccurate, which leads to inferior
> profile quality and limits PGO performance.
>
>
>
> I think we need more quantification of the impact of using debug
> information for matching purposes: How much performance are left on the
> table due to this, and are they fixable issues or not.
>
>
>
> [wenlei] The first table in the result section is comparing pseudo-probe
> with AutoFDO and Instr. PGO, all with inlining turned off. So that’s a
> quantitative assessment of the effectiveness of pseudo-probe. It’s hard to
> assess performance benefit though, because PGO performance is a function of
> profile quality and heuristic. Currently heuristics are tuned to cope with
> the profile quality we have, so it may not do justice for profile quality
> improvements that pseudo-probe brings us.
>
>
>
> One example is how FDO inliner evaluates call site. It checks callee’s
> total sample count instead of callee’s entry count. This is less than
> ideal, but we couldn’t fix it due to profile quality issues – we can’t
> reliably get inlinee’s entry count with dwarf based approach, see
> discussion in https://reviews.llvm.org/D60086
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D60086&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=IiraiO5pLd86sJtoupX-V4fgITYAQHvv2GN-H_UmDXQ&s=TVgYwUBqNvzMAOEwn2FDgcKlvRrsbAvEXT4OscZS2n4&e=>.
> That problem is solved with pseudo-probe, but until we change the inliner,
> we won’t see perf win from that particular profile quality improvement.
> There are other similar cases too, and that’s why we used profile quality
> metric instead of performance to assess pseudo-probe.
>
>
>
> Can you change the inliner to use entry count when probe based profile is
> used?
>
>
>
> [wenlei] Yes, we already made that change, and that’s one of the “few
> other improvements for the FDO inliner” I mentioned in the RFC. This is
> one example of the coupling between heuristic and profile quality.
>
>
>
> [hongtao] Yes, we strive to get to the peak performance with the FDO
> inliner tuned up for the combination of CSSPGO and pseudo probe. We haven't
> tuned for pseudo probe individually despite an initial promising results
> over AutoFDO on quite some SPEC2k6 benchmarks.
>
>
>
> It is probably also interesting to see some performance number for large
> server workload :) Topdown inlining can potentially bloat up code a lot
> leading to worse performance for programs with large instruction working
> set -- but this is of course tunable.
>
>
>
> [hongtao] Exactly. We haven’t tried with large workload yet but it’s
> definitely one of our ultimate goals. We did refine the inliner with more
> size controls but there’s going to be a lot more tunings. Upstreaming
> everything we have is our first step. We hope to see potential
> co-development/coordination in the future.
>
>
>
>
>
> [wenlei] Yes, as Hongtao pointed out the ultimate goal is definitely to
> improve performance of large server workloads. We wanted to start
> upstreaming the changes while working on evaluating perf on larger
> workloads. I think there’s benefit in upstreaming this work now, as it
> makes it possible for others to evaluate early, and also avoid us having to
> keep a large chunk of changes as private patches. What do you think?
>
Sounds good to me. We will discuss how to organize the changes in a way
that is most maintainable in patch reviews.
>
>
> You’re right that we cannot let top-down inliner run unbounded. This
> current FDO is bounded by previous SCC inline as it only does replay, so
> it’s very simple. For top-down inlining with CS profile, it can go far
> beyond replay. So we needed call site prioritized BFS top-down inlining
> with a growth or size cap, which is already implemented internally. Again,
> this is among the “improvements for the FDO inliner” I mentioned earlier.
> 😊 There’s lots of tuning to be done, and we will likely have to
> constrain the FDO inliner initially, and gradually let it take over more
> inlining for PGO as it matures. But I think the perf and size numbers from
> SPEC is a very good sign.
>
>
>
> I also wanted to point out that even though we haven’t got to point where
> we have perf numbers for large workload yet (we simply haven’t tried yet as
> we’re still working on refining the infrastructure), we do see quite a few
> cases in large workloads where top-down inlining with CS profile and its
> specialization would help derive better inline decision.
>
>
>
> Eventually, with all pieces in place, we expect top-down inlining with CS
> profile to save code size, hence help reducing working set. This is because
> top-down inlining with CS profile is more selective.
>
>
>
sounds good.
>
>
>
>
>
>
> Some of the issues may be fixable with dwarf info maintenance, but the
> engineering cost to find and fix all issues are non-trivial. We think
> maintaining anchor as IR is a more sustainable alternative than maintaining
> anchor as metadata.
>
>
>
>
>
> 1.
>
> To lift the above limitations, we’d like to propose an alternative design
> that consists of two components: 1) Context-sensitive sample PGO, 2) Sample
> to IR mapping using pseudo probes. The goal is to further improve sample
> PGO performance while maintaining usability and keeping training runtime
> overhead at zero. In addition, we hope the CSSPGO framework can also open
> up opportunities for new optimizations with more stringent requirements on
> profile quality.
>
>
>
>
>
>
>
> CSSPGO is a very attractive optimization by itself. Can you provide more
> motivation for the pseudo probes?
>
>
>
> [wenlei] One way to look at the combination of pseudo-probe and
> context-sensitive sample PGO is that, the former brings sample PGO closer
> to instrumentation PGO, and the latter to sample PGO is like the two-stage
> CSPGO, or even post-link optimizer to instrumentation PGO. These are two
> orthogonal problems that needs separate solutions.
>
>
>
>
>
> There are also differences though:
>
>
>
> 1) CSPGO has lots of flow sensitivity and PLO has even more flow
> sensitivity while CSSPGO does not. CSSPGO has the advantage to guide
> inliner as well
>
>
>
> [wenlei] Fair point. Though I’m wondering how much perf win does flow
> sensitivity bring to PGO? Curious if you have data for this. My guess is
> context sensitivity is much more visible than flow sensitivity for PGO’s
> effectiveness.
>
>
>
> 2) Pseudo-probes are inserted pretty early in the pipeline, so it won't
> beat instrumentation PGO performance as the latter has early inlining to
> expose some CS. In other words, Pseudo-probe depends on CSSPGO, but not the
> other way around.
>
>
>
> [wenlei] We intentionally insert pseudo-probe early for better resilience
> to compiler version changes, knowing that context-sensitivity will be
> covered by CSSPGO. We could also insert pseudo-probe later like Instr PGO
> to cover some context-sensitivity. We choose to do pseudo instrumentation
> early because we view the combination as package even though they can be
> decoupled for clean design. That said, I agreed that it’s easier for CSSPGO
> to work without pseudo-probe than for pseudo-probe to work without CSSPGO.
>
>
>
> [hongtao] By flow-sensitivity, do you mean the execution trace of blocks
> in a function?
>
>
>
>
>
> More like the path sensitive profile -- a realistic way of getting that is
> from post cfg transformation profiles. Rong is going to share a proposal
> based on the current AFDO implementation.
>
>
>
> [wenlei] How significant flow-sensitivity is comparing to
> context-sensitivity? Looking forward to the proposal, and wondering if it
> can be combined with CSSPGO and pseudo-probe.
>
>
>
They are mostly orthogonal though.
David
> [hongtao] Great, looking forward to Rong’s proposal.
>
> This is missing from CSSPGO currently. Pseudo probe can be viewed as a
> cost-free instrumentation technique that correlates hardware samples to the
> IR for sample profiling. It may never achieve the precision of real
> instrumentation. It is currently combined with CSSPGO to obtain a
> context-sensitive profile. It can also be extended for flow-sensitivity
> (based on LBR) and value profiling (based on hardware register snapshot).
>
>
>
>
>
>
>
> There’re other secondary motivations for pseudo-probe as well beyond its
> profile quality benefits that I didn’t mention earlier:
>
> 1). Stale profile detection. With line numbers, it’s hard to detect and
> react to stale profile. Pseudo-probes are tied to blocks so it’s
> effectively using CFG as carrier for profile, which makes stale profile
> detection easier.
>
> 2). Resilience to source changes. We’ve seen cases where deleting a single
> line of comment caused a 8% perf regression for a large service because it
> completely messed up profile annotation for a critical path. That will not
> happen with pseudo-probe – any source change not altering CFG will be
> tolerated without perf impact.
>
>
>
> While this is true, the problem with CFG based approach is that a local
> CFG change can make the whole profile losing profile which can be bad too.
> Debug info based approach allows partial matching while relying on a
> propagation algorithm to compensate the rest.
>
>
>
> [wenlei] If we want to tolerate local CFG change, and still match majority
> of CFG, we could employ fuzzy CFG matching, and still using propagation to
> infer the unmatched parts. I think that should be easy to do, and more
> effective than line based fuzzy/partial match still. That’s something we
> planned to implement too.
>
>
>
> [hongtao] Yes, a local CFG change may invalidate a CFG-based profile. We
> are looking into a fuzzy CFG matching approach to minimize the
> invalidation. It may be based on CFG region analysis and value-numbering
> branch compares and function calls. On the other hand, the debug-info-based
> approach may not be resilient to code refactoring changes or semantics
> changes like branch flipping. We’d like users to be notified about such
> changes so that they can keep their profiles up-to-date.
>
>
>
>
>
> would matching profile with a flipped branch lead to wrong swapping of
> taken/nontaken weights?
>
>
>
> [hongtao] Yes. If user flips the then-else blocks of an if-statement, the
> current compiler will still apply the profile from the original code which
> will lead to wrong branch weights.
>
>
>
> [wenlei] Right, that would lead to wrong weights, which is problematic for
> line-based approach as it cannot tell that flipping has happened. CFG/probe
> based approach can do better.
>
>
>
>
>
> thanks,
>
>
>
> David
>
>
>
>
>
>
>
> 3). Possibility of offline count inference. We have an experiment that
> encodes edges alongside with probes (blocks), so more sophisticated offline
> count inference algorithm is possible to further improve profile quality.
> Our algorithm researchers are working on new profile inference solution now.
>
>
>
> This is needed because critical edges can not be splitted as
> instrumentation based PGO?
>
>
>
> [wenlei] Yes, this is one of the cases we want to cover. We also have the
> option to insert nop for critical edges, but we want to avoid that, as it
> may lead to visible run time overhead.
>
>
>
>
>
>
>
>
>
> *Context-sensitive Sample PGO*
>
> The effectiveness of BOLT, Propeller and CSPGO all demonstrated the
> importance of context-sensitive profile for PGO. However there are two
> limitations with the existing approaches.
>
> 1. The current solutions focus on leveraging a context-sensitive
> profile to attain an accurate post-inline profile that helps achieve a
> better code layout, but do not use the context-sensitive profile to drive
> better inlining.
>
> 2. The current solutions involve multiple training processes and
> profiles (e.g. a post-inline profile for CSPGO, or a post-link profile for
> BOLT and Propeller), which incurs higher operational cost and complicates
> the build and release workflow.
>
> We propose a full context-sensitive sample profiling infrastructure that
> utilizes both LBR and call stack samples at the same time to synthesize a
> profile with a full context sensitivity. The key advantage is that rather
> than relying on previous inlining or a separate profile, the profile
> collected with the new approach will have full calling contexts recovered
> from both inlined and not inlined call sites. To achieve an accurate
> post-inline profile, a separate profile is no longer needed. Instead, the
> post-inline profile can be directly derived from adjusting the input
> profile based on all inline decisions. The richer context-sensitive profile
> also enables better inline decisions. The infrastructure has two key
> components listed below.
>
>
>
> *Synthesizing context-sensitive LBR with a virtual unwinder*
>
> To make sample PGO’s input profile context aware, we need to know the call
> stack of each LBR fall through path. That is done by sampling LBR and call
> stack simultaneously. With that, each sample will contain a call stack in
> addition to LBR entries. We use level 2 PEBS to control sampling skid so
> that the leaf frame from stack sample aligns with leaf frame from LBR. The
> raw call stack sample describes the calling context for the leaf LBR entry.
> In addition, by unwinding “call” and “return” (including implicit ones from
> inlinee) from LBR entries backwards on top of raw stack samples, we can
> recover the calling context for each of the LBR entries from the sample,
> thus synthesizing context-sensitive LBR profile.
>
>
>
> We can then generate context-sensitive sample PGO profile using the
> context-sensitive LBR profile. In the new profile, a function’s profile
> becomes a collection of profiles, each representing a profile for a given
> calling context.
>
>
>
>
>
> Sounds good -- see the overhead question posted at the beginning.
>
>
>
>
>
>
>
> *Context-sensitive FDO/PGO framework in LLVM*
>
> In order to leverage context-sensitive profile for inlining, and to
> maintain accurate post-inline counts, we introduced SampleContextTracker which
> is a layer sitting in between input profile and the profile used to
> annotate CFG for optimizations. We also introduced the notion of base
> profile which is the merged profile for function’s profiles from any
> outstanding (not inlined) context, and context profile which is a
> function's profile for a given calling context. The framework includes four
> simple APIs for updating and query profiles:
>
>
>
> Query API:
>
> · getBaseSamplesFor: Query base profile by function name.
>
> · getContextSamplesFor: Query context profile by calling context
> and function name.
>
> Update API:
>
> · MarkContextSamplesInlined: When a function is inlined for a
> given calling context, we need to mark the context profile for that context
> as inlined. This is to make sure we don't include inlined context profile
> when synthesizing the base profile.
>
> · PromoteMergeContextSamplesTree: When a function is not inlined
> for a given calling context, we need to promote the context profile tree to
> be top-level context. This preserves the child context under that function
> so later inline decisions for calls originating from that not inlined
> function will still be driven by an accurate context profile.
>
> These APIs are used by SampleProfileLoader’s inlining, for better inline
> decisions and better post-inline counts. For optimal results, the new
> infrastructure needs to work with a top-down FDO inliner. We added top-down
> FDO inlining under a switch, and the switch is turned on by default in
> upstream recently. There’re a few other improvements for the FDO inliner
> that we plan to upstream soon.
>
>
>
> The profile data should be usable by the SCC inliner as well. In the
> bottom up inlining, as the function gets inline further up in the call
> chain, the inline instance has few incoming contexts to merge.
>
>
>
> [wenlei] Yes, we intentionally introduced the SampleContextTracker
> abstraction that is decoupled from SampleProfileLoader, so it can work with
> both FDO inliner and SCC inliner. But we expect FDO inliner to take over
> more inlining for CSSPGO because the FDO inliner is no longer a replay
> inliner now. And it’s good as top-down inline helps with specialization
> which is important for context-sensitive inlining.
>
>
>
>
>
>
>
> *Pseudo-instrumentation for sample to IR mapping*
>
> Being able to profile production binaries is a key advantage of AutoFDO
> over Instrumentation PGO, but it also comes with a big challenge. While
> using line number and discriminator as anchor for profile mapping incurs
> zero run time overhead for AutoFDO, it’s not as accurate as instrumented
> probes. This is because the instrumented probes are part of the IR, rather
> than metadata attached to the IR like !dbg. That has two implications: 1)
> it’s easier to maintain IR than metadata for optimization passes; 2) probe
> blocks some CFG transformations that can mess up profile correlation.
>
>
>
> With the proposed pseudo instrumentation, we can achieve most of the
> benefit of instrumentation PGO in little runtime overhead. We instrument
> each basic block with a pseudo probe associated with the block Id. Unlike
> in PGO instrumentation where a counter is implemented as a persisting
> operation such as atomic read/write or runtime helper call, a pseudo probe
> is implemented as a dedicated intrinsic call with IntrInaccessibleMemOnly flag.
> The intrinsic comes with most of the semantics of a PGO counter but is
> much less optimization-intrusive.
>
>
>
> The pseudo probe intrinsic calls are on the IR throughout the optimization
> and code generation pipeline and are materialized as a piece of binary data
> stored in a separate .pseudo_probe data section.
>
>
>
> How are these information maintained? Blocks can be eliminated or cloned
> in many optimization passes: jump threading, taildup, unrolling, peeling
> etc. For instance, how to handle the block that is merged into another?
> Does it lose samples because of this?
>
>
>
> [wenlei] They are just maintained as part of IR, like any other
> instructions, without special care. The key difference is they’re part of
> IR instead of metadata attached to IR. We can categorize relevant CFG
> transformations into 1) duplication, 2) merge and removal.
>
> For any duplication, tail/head dup, unrolling, probe will be duplicated
> along with other instructions, and we don’t need duplication factor that
> was used by dwarf-based approach, because counts from duplicated probes
> will be added together naturally. For merge and removal,
> IntrInaccessibleMemOnly flag will block it, similar to real probes.
>
>
>
> Pseudo-probe is a framework that is tunable. Depending on the semantic we
> put on the intrinsic, it can be as heavy as real probe, or as light as a
> label. IntrInaccessibleMemOnly is a carefully chosen semantic based on
> our experiments that balances run time overhead and profile quality – it
> doesn’t incur measure-able overhead even though it still blocks merging and
> removal, we didn’t see measure-able overhead from SPEC or a large internal
> workload. But the profile quality improvement is measure-able as the 1st
> table in result section shows.
>
>
>
>
>
>
>
> The section is then used to map binary samples back to blocks of CFG
> during profile generation. There are also no real machine instructions
> generated for a pseudo probe and the.pseudo_probe section won’t be loaded
> into memory at runtime, therefore they should incur very little runtime
> overhead. As a fact, we see no measure-able performance impact from
> pseudo-instrumentation itself on SPEC2006 or big internal workload.
>
>
>
> How large are the probe sections?
>
>
>
> [wenlei] About 10% of binary size, another 2% if we encode CFG edges in
> addition to probes/blocks.
>
>
>
>
>
> *Pseudo-instrumentation integration and Pass Ordering*
>
> One implication from pseudo-probe instrumentation is that the profile is
> now sensitive to CFG changes. We now defect stale profiles for sample PGO
> via CFG checksum, instead of just using it. However, the potential downside
> is that CFG may change between different versions of the compiler even if
> the source code is unchanged. To solve that problem, we perform the pseudo
> instrumentation very early in the pre-LTO pipeline, before any CFG
> transformation. This ensures that the CFG instrumented and annotated is
> stable. We added SampleProfileProber that performs the pseudo
> instrumentation and runs independent of profile annotation.
>
>
>
> A new switch -fpseudo-probe-for-profiling is added to enable sample PGO
> with pseudo instrumentation, similar to -fdebug-info-for-profiling for
> AutoFDO. Input profile is still provided through the same switch used by
> today’s AutoFDO, namely -fprofile-sample-use, and the profile loader will
> automatically decide how to load and annotate profile depending on whether
> input profile is dwarf-based or pseudo-probe based.
>
>
>
>
>
> Can you compare the source change tolerance of pseudo probe based approach
> vs debug info based approach?
>
>
>
> [wenlei] Pseudo-probe should be more resilient to source changes. See my
> reply for motivation of pseudo-probe. Pseudo-probe will be able to tolerate
> source changes as long as they don’t alter CFG. On the contrary, changes
> that delete a comment and shift line offset can cause perf churn with
> line-based approach. We've been bitten by this a few times – people making
> comment only change during holiday freeze only to find surprising perf
> regression due to AutoFDO 😊. It also opens up possibility of fuzzy CFG
> matching when there’s a CFG mutation due to source change to make it even
> more resilient.
>
>
>
>
>
> Ok. Also see my reply above. It seems to me that the line shifting problem
> should be solvable for AFDO (or make it more tolerant).
>
>
>
> [wenlei] Agreed that we can do better with line number approach too. But
> CFG as profile carrier has richer info than line, and is closer to profile
> which is inherently CFG based. So I think it should be easier with probe
> and CFG.
>
>
>
>
>
>
>
>
>
> *New profile format and profile generation*
>
> We extend current profile format in order to be able to represent a full
> context-sensitive profile and also encode pseudo-probe info. This is done
> without drastically diverging from today’s AutoFDO profile format so that
> existing tools and libraries can be reused with minor changes (e.g.
> llvm-profdata, profiler reader and writer).
>
>
>
> For a context-sensitive profile, we extend the profile format by changing
> the function profile header line to include calling context in addition to
> function name. With today’s AutoFDO, we have a single profile header for
> each function to represent its accumulative profile. E.g. This is the
> profile header for foo, with 1290 total samples, and 74 header samples.
>
>
>
> foo:1290:74
>
>
>
> For CSSPGO, we will have multiple profile headers for a single function’s
> profile, each representing profile for a specific calling context as shown
> below. CSSPGO profile header is bracketed to differentiate from today’s
> AutoFDO.
>
>
>
> [main:12 @ bar:3 @ foo]:279:54
>
> [main:19 @ zoo:7 @ foo]:1011:20
>
>
>
>
>
> sounds good.
>
>
>
>
>
>
>
> With calling context encoded in the function header, we no longer need a
> nested function profile for inlinees. Instead, a context profile will be
> represented uniformly using context strings in the function profile header,
> regardless of whether the calls in the context are inlined or not. The flat
> structure makes sure that context profile is easily indexable. The change
> is mostly transparent to tools like llvm-profdata. Support for binary
> profile format has not been added yet, but should be easy to do.
>
>
>
>
>
>
>
> It is still useful to annotate (as least with comment line) that a profile
> is for top level function or inline instance.
>
>
>
> [wenlei] Agreed, and that’s in our plan too - we need that for tuning
> purpose.
>
>
>
>
>
> For pseudo-probe, we repurposed the line to count map of AutoFDO profile
> to be block Id to count map. This only changes the interpretation of
> profile content rather than the representation, hence all reader/writer
> helpers can be reused.
>
>
>
> There's a new profile generation tool, llvm-profgen, with the virtual
> winder implemented for context-sensitive profiling, and uses the
> .pseudo_probe section to map binary profile to pre-opt CFG profile. Since
> profile generation is a critical piece of the workflow, we’d like to
> propose to include the tool as part of LLVM, alongside with llvm-profdata.
>
>
>
>
>
> *Preliminary Results*
>
> To quantitatively assess profile quality improvement brought by
> pseudo-instrumentation, we introduce a profile quality metric. We measure
> the metric by first annotating an optimized binary with the MIR block
> execution counts derived from a profile. The binary is then sampled and the
> counts are compared against the annotation. The weighted relative delta is
> used as an indicator for profile quality (lower is better).
>
>
>
> Table below shows the profile quality metric for SPEC2006. We can see from
> the numbers that the profile quality of pseudo-instrumentation sample PGO
> is much better than AutoFDO and close to instrumentation PGO.
>
>
>
> Profile quality metric
>
> Baseline AutoFDO
>
> Instrumentation PGO
>
> Sample PGO w/ Pseudo Instrumentation
>
> SPEC2006
>
> 24.58%
>
> 15.70%
>
> 16.21%
>
>
>
>
>
> Instrumentation PGO does not have context sensitivity, so I would expect
> it scores worse than CSSPGO. Do you know why it is better here?
>
>
>
> [wenlei] This is for evaluating effectiveness of pseudo-probe exclusively.
> We have all inlining turned off for this experiment, and this is without
> context-sensitive profile for Sample PGO either, so the comparison should
> be fair, and Instrumentation PGO should be the upper bound.
>
>
>
>
>
> It would be nice to see the main source of precision loss of AFDO here.
> Probably related to the missing edge information Wei mentioned.
>
>
>
> [wenlei] The edge count issue Wei mentioned isn’t handled by pseudo probe
> either, at least not for now. From our investigation, the problem here is
> more like death by a thousand cut.
>
>
>
>
>
> thanks,
>
>
>
> David
>
>
>
>
>
>
>
> We also measured performance and code size on SPEC2006 with CSSPGO. The
> measurement was done with MonoLTO and new pass manager, with tuning for FDO
> inliner to accommodate context-sensitive profile, and used training dataset
> for both pass1 and pass2. The result shows ~2% performance win on top of
> today’s AutoFDO, with ~4% .text reduction, see table below.
>
>
>
> SPEC2006
>
> Performance
>
> Code Size
>
> AutoFDO over LTO
>
> CSSPGO
>
> Over LTO
>
> CSSPGO over AutoFDO
>
> AutoFDO over LTO
>
> CSSPGO
>
> Over LTO
>
> CSSPGO over AutoFDO
>
> Geomean Delta %
>
> 6.80%
>
> 8.70%
>
> 2.04%
>
> 11.17%
>
> 6.66%
>
> 4.06%
>
>
>
> While the SPEC2006 benchmark suite is different from large workloads, we
> think the results demonstrated the potential of CSSPGO and served its
> purpose for proof of concept. We plan to continue tuning and start to
> evaluate larger internal workloads soon, and we’d like to upstream our
> work. Feedbacks are welcomed!
>
>
>
>
>
>
>
> What is the performance win with peudo-probe alone?
>
>
>
> [wenlei] We don’t have numbers for pseudo-probe along. As I mentioned
> earlier, profile quality improvement may not translate directly to perf win
> without heuristic changes. That’s why we evaluate pseudo-probe exclusively
> with profile quality metric. The hope is that it will open up opportunity
> for better optimizations. E.g. it could potentially help the Machine
> Function Splitting pass too. That said, pseudo-probe does bring extra win
> for CSSPGO comparing to line-based CSSPGO for some benchmarks, but we
> didn’t tune CSSPGO with line-based profile, so we didn’t aggregate numbers
> as the comparison isn’t fair either.
>
>
>
>
>
> thanks,
>
>
>
> David
>
>
>
>
>
>
>
> Thanks,
>
> Wenlei & Hongtao
>
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200808/5f40f6fe/attachment-0001.html>
More information about the llvm-dev
mailing list