[llvm-dev] [RFC] Context-sensitive Sample PGO with Pseudo-Instrumentation

Sat Aug 8 14:01:55 PDT 2020

On Sat, Aug 8, 2020 at 1:09 PM Rahman Lavaee <rahmanl at google.com> wrote:

> Thanks for the kind words.
> For the basic block mapping, would it not be sufficient if we add IR basic
> block ids to every BB info record? Since BB info emission is done at the
> end of codegen, the final BB records are all the machine basic blocks which
> have made it into the final binary.
>

My understanding is that pseudo probes need to be inserted early and it
does not rely on existing inlining behavior to get context sensitive info
for profiles.

David

>
> On Sat, Aug 8, 2020 at 10:27 AM Hongtao Yu <hoy at fb.com> wrote:
>
>> Hi Rahman,
>>
>>
>>
>> Thanks for sharing the BB-info section proposal which is a shiny idea. I
>> think the BB-info and pseudo probes deal with a similar problem in
>> different spaces, i.e., mapping hardware samples to corresponding basic
>> blocks. In the context of pseudo probes, we much focus on mapping samples
>> back to source-level blocks which is the input to the optimizer. Therefore
>> we are building a persisting probe for each block that live through massive
>> machine-independent/machine-dependent transforms. Besides probing basic
>> blocks, a probe can be used to probe each value site of interest. So far
>> only direct/indirect call sites are supported.
>>
>>
>>
>> *From: *Rahman Lavaee <rahmanl at google.com>
>> *Date: *Saturday, August 8, 2020 at 9:44 AM
>> *To: *Wenlei He <wenlei at fb.com>
>> *Cc: *Hongtao Yu <hoy at fb.com>, Wei Mi <wmi at google.com>, Xinliang David
>> Li <davidxl at google.com>, "llvm-dev at lists.llvm.org" <
>> llvm-dev at lists.llvm.org>
>> *Subject: *Re: [llvm-dev] [RFC] Context-sensitive Sample PGO with
>> Pseudo-Instrumentation
>>
>>
>>
>> Hi Wenlei and Hogtao,
>>
>> This sounds like an interesting (and complex) project. Do you think you
>> can utilize the BB-info section (
>> https://lists.llvm.org/pipermail/llvm-dev/2020-July/143512.html
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_pipermail_llvm-2Ddev_2020-2DJuly_143512.html&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=cILB1YbWQ4KwciWgglfl8A&m=rFyHD7KTOVCsiQSIIXybvwhpIj0GaQtntyiY6YBHvkI&s=iT9SflcVSIzKK7B0gDUGOMNsJ1fUf0X67NbJN3ljQRs&e=> as
>> an alternative to pseudo probes?
>>
>>
>>
>>
>>
>> On Fri, Aug 7, 2020 at 10:53 PM Wenlei He via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> See my answers inline.
>>
>>
>>
>> *From: *Xinliang David Li <davidxl at google.com>
>> *Date: *Friday, August 7, 2020 at 7:57 PM
>> *To: *Wenlei He <wenlei at fb.com>
>> *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>, Wei Mi <
>> wmi at google.com>, Hongtao Yu <hoy at fb.com>
>> *Subject: *Re: [RFC] Context-sensitive Sample PGO with
>> Pseudo-Instrumentation
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Aug 7, 2020 at 4:44 PM Wenlei He <wenlei at fb.com> wrote:
>>
>> Thanks for the thoughtful questions, David. See my answers inline.
>>
>>
>>
>> Thanks,
>>
>> Wenlei
>>
>>
>>
>> *From: *Xinliang David Li <davidxl at google.com>
>> *Date: *Friday, August 7, 2020 at 1:24 PM
>> *To: *Wenlei He <wenlei at fb.com>
>> *Cc: *"llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>, Wei Mi <
>> wmi at google.com>, Hongtao Yu <hoy at fb.com>
>> *Subject: *Re: [RFC] Context-sensitive Sample PGO with
>> Pseudo-Instrumentation
>>
>>
>>
>> Wenlei, Thanks for the interesting proposal! please see my replies inline
>> below.
>>
>>
>>
>> On Fri, Aug 7, 2020 at 11:28 AM Wenlei He <wenlei at fb.com> wrote:
>>
>> Hi All,
>>
>> Our team at Facebook is building a new context-sensitive Sample PGO as an
>> alternative to the existing AutoFDO. We’d like to share our motivation,
>> propose a new design, and reveal preliminary results on benchmarks. We will
>> refer to the proposed design as CSSPGO in this RFC.
>>
>>
>>
>> The new CSSPGO leverages simultaneous LBR and stack sampling to construct
>> a full context-sensitive profile.
>>
>>
>>
>>
>>
>> Can you share more details on this? LBR only has 32 entries, so it won't
>> give you full call context, so stack unwinding is needed. What is the
>> overhead you see in production environment?
>>
>>
>>
>> [wenlei] We are not worried about overhead in production environment as
>> the sampling rate there is extremely low. We did measure locally however,
>> for stack sampling and level 2 PEBS on top of regular LBR sampling, the
>> overheads isn’t very noticeable still, but I don’t have numbers at hand.
>>
>>
>>
>>
>>
>>
>>
>> I assume this is with no-omit-frame-pointer option right?
>>
>>
>>
>> [wenlei] Right, and tail call is off too for our experiments, but we may
>> keep it on for production usage later. See my reply to Wei’s question on
>> this.
>>
>>
>>
>>
>>
>>
>>
>> It doesn’t rely on previous inlining like today’s AutoFDO to get
>> context-sensitive profile, and it also doesn’t need a separate post-inline
>> context-sensitive profile like CSPGO.
>>
>>
>>
>> What is the sample profile data size impact with the full context
>> information?
>>
>>
>>
>> [wenlei] Text CS profile is normally around 1x-10x of regular profile
>> size, with all live context included. We plan to trim cold context, which
>> we expect to bring the size down in a meaningful way. Another source of
>> size increase is the context string, which could contain duplicated mangle
>> names (can be very long for C++ templated code), but should be very
>> compressible with the built-in compression support from extended binary
>> profile. We will move to extended binary format, and leverage the
>> compression support if needed. We can also consider more efficient
>> fixed-length integer context representation (similar to rolling hash).
>>
>>
>>
>>
>>
>> What is the average and max number of live contexts you have seen?
>> Statically it grows exponentially as the depth of the context increases.
>>
>>
>>
>> [wenlei] I guess you meant the ratio of number of live contexts to number
>> of functions? I haven’t looked, but I’d expect profile size ratio to be a
>> good proxy for that.
>>
>>
>>
>> In addition, we introduced pseudo-instrumentation for more accurate
>> mapping from binary samples back to IR, similar to instrumentation PGO, but
>> without any measure-able runtime overhead that is usually associated with
>> instrumentation.
>>
>>
>>
>>
>>
>> Is CSSPGO inherently dependent upon pseudo-probe or is it orthogonal? I
>> hope that it is the latter :)
>>
>>
>>
>> [wenlei] They’re orthogonal. Context-sensitive SPGO can work without
>> pseudo-probe and still use dwarf. Our plan is to keep context-sensitive
>> SPGO working w/ and w/o pseudo-probe functionality-wise, but we only look
>> at perf and tune with the two combined.
>>
>>
>>
>>
>>
>> great.
>>
>>
>>
>>
>>
>> We have a functioning implementation for the new CSSPGO now. Initial
>> results on SPEC2006 shows ~2% geomean performance win on top of AutoFDO
>> (with MonoLTO and NewPM) and ~4% .text size reduction at the same time.
>>
>>
>>
>>
>>
>> *Motivation*
>>
>> AutoFDO is a big success as it lowers the entry barrier for PGO
>> significantly while still delivering substantial performance boost.
>> However, there’s still a gap between AutoFDO and its instrumentation
>> counterpart. From several failed internal attempts to improve AutoFDO, we
>> realized that the bottleneck of AutoFDO lies in its profile quality. With
>> the current level of profile quality, it’s difficult to reap more
>> performance win because good heuristics are often limited by inferior
>> profile. That prompted a systemic effort to investigate and improve AutoFDO
>> framework. Specifically, we’re trying to handle the two biggest sources of
>> profile quality issues:
>>
>>
>>
>> 1.       AutoFDO relies on a limited context-sensitive profile collected
>> based on previous inlining. Therefore it can only replay or prune the
>> previous inlining. With the main CGSCC inliner, post-inline counts are not
>> accurate due to scaling of context-less profile, which affects the
>> effectiveness of later passes such as profile-guided code layout.
>>
>>
>>
>> Acknowledge of the limitation here.
>>
>>
>>
>> 1.
>>
>> 2.       Dwarf line and discriminator info aren’t always well-maintained
>> throughout the compilation, thus using them as anchors to map binary
>> samples back to the IR can sometimes be inaccurate, which leads to inferior
>> profile quality and limits PGO performance.
>>
>>
>>
>> I think we need more quantification of the impact of using debug
>> information for matching purposes:  How much performance are left on the
>> table due to this, and are they fixable issues or not.
>>
>>
>>
>> [wenlei] The first table in the result section is comparing pseudo-probe
>> with AutoFDO and Instr. PGO, all with inlining turned off. So that’s a
>> quantitative assessment of the effectiveness of pseudo-probe. It’s hard to
>> assess performance benefit though, because PGO performance is a function of
>> profile quality and heuristic. Currently heuristics are tuned to cope with
>> the profile quality we have, so it may not do justice for profile quality
>> improvements that pseudo-probe brings us.
>>
>>
>>
>> One example is how FDO inliner evaluates call site. It checks callee’s
>> total sample count instead of callee’s entry count. This is less than
>> ideal, but we couldn’t fix it due to profile quality issues – we can’t
>> reliably get inlinee’s entry count with dwarf based approach, see
>> discussion in https://reviews.llvm.org/D60086
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__reviews.llvm.org_D60086&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=KfYo542rDdZQGClmgz-RBw&m=IiraiO5pLd86sJtoupX-V4fgITYAQHvv2GN-H_UmDXQ&s=TVgYwUBqNvzMAOEwn2FDgcKlvRrsbAvEXT4OscZS2n4&e=>.
>> That problem is solved with pseudo-probe, but until we change the inliner,
>> we won’t see perf win from that particular profile quality improvement.
>> There are other similar cases too, and that’s why we used profile quality
>> metric instead of performance to assess pseudo-probe.
>>
>>
>>
>> Can you change the inliner to use entry count when probe based profile is
>> used?
>>
>>
>>
>> [wenlei] Yes, we already made that change, and that’s one of the “few
>> other improvements for the FDO inliner” I mentioned in the RFC. This is
>> one example of the coupling between heuristic and profile quality.
>>
>>
>>
>>
>>
>>
>>
>> Some of the issues may be fixable with dwarf info maintenance, but the
>> engineering cost to find and fix all issues are non-trivial. We think
>> maintaining anchor as IR is a more sustainable alternative than maintaining
>> anchor as metadata.
>>
>>
>>
>>
>>
>> 1.
>>
>> To lift the above limitations, we’d like to propose an alternative design
>> that consists of two components: 1) Context-sensitive sample PGO, 2) Sample
>> to IR mapping using pseudo probes. The goal is to further improve sample
>> PGO performance while maintaining usability and keeping training runtime
>> overhead at zero. In addition, we hope the CSSPGO framework can also open
>> up opportunities for new optimizations with more stringent requirements on
>> profile quality.
>>
>>
>>
>>
>>
>>
>>
>> CSSPGO is a very attractive optimization by itself.  Can you provide more
>> motivation for the pseudo probes?
>>
>>
>>
>> [wenlei] One way to look at the combination of pseudo-probe and
>> context-sensitive sample PGO is that, the former brings sample PGO closer
>> to instrumentation PGO, and the latter to sample PGO is like the two-stage
>> CSPGO, or even post-link optimizer to instrumentation PGO. These are two
>> orthogonal problems that needs separate solutions.
>>
>>
>>
>>
>>
>> There are also differences though:
>>
>>
>>
>> 1) CSPGO has lots of flow sensitivity and PLO has even more flow
>> sensitivity while CSSPGO does not. CSSPGO has the advantage to guide
>> inliner as well
>>
>>
>>
>> [wenlei] Fair point. Though I’m wondering how much perf win does flow
>> sensitivity bring to PGO? Curious if you have data for this. My guess is
>> context sensitivity is much more visible than flow sensitivity for PGO’s
>> effectiveness.
>>
>>
>>
>> 2) Pseudo-probes are inserted pretty early in the pipeline, so it won't
>> beat instrumentation PGO performance as the latter has early inlining to
>> expose some CS. In other words, Pseudo-probe depends on CSSPGO, but not the
>> other way around.
>>
>>
>>
>> [wenlei] We intentionally insert pseudo-probe early for better resilience
>> to compiler version changes, knowing that context-sensitivity will be
>> covered by CSSPGO. We could also insert pseudo-probe later like Instr PGO
>> to cover some context-sensitivity. We choose to do pseudo instrumentation
>> early because we view the combination as package even though they can be
>> decoupled for clean design. That said, I agreed that it’s easier for CSSPGO
>> to work without pseudo-probe than for pseudo-probe to work without CSSPGO.
>>
>>
>>
>>
>>
>> There’re other secondary motivations for pseudo-probe as well beyond its
>> profile quality benefits that I didn’t mention earlier:
>>
>> 1). Stale profile detection. With line numbers, it’s hard to detect and
>> react to stale profile. Pseudo-probes are tied to blocks so it’s
>> effectively using CFG as carrier for profile, which makes stale profile
>> detection easier.
>>
>> 2). Resilience to source changes. We’ve seen cases where deleting a
>> single line of comment caused a 8% perf regression for a large service
>> because it completely messed up profile annotation for a critical path.
>> That will not happen with pseudo-probe – any source change not altering CFG
>> will be tolerated without perf impact.
>>
>>
>>
>> While this is true, the problem with CFG based approach is that a local
>> CFG change can make the whole profile losing profile which can be bad too.
>> Debug info based approach allows partial matching while relying on a
>> propagation algorithm to compensate the rest.
>>
>>
>>
>> [wenlei] If we want to tolerate local CFG change, and still match
>> majority of CFG, we could employ fuzzy CFG matching, and still using
>> propagation to infer the unmatched parts. I think that should be easy to
>> do, and more effective than line based fuzzy/partial match still. That’s
>> something we planned to implement too.
>>
>>
>>
>> 3). Possibility of offline count inference. We have an experiment that
>> encodes edges alongside with probes (blocks), so more sophisticated offline
>> count inference algorithm is possible to further improve profile quality.
>> Our algorithm researchers are working on new profile inference solution now.
>>
>>
>>
>> This is needed because critical edges can not be splitted as
>> instrumentation based PGO?
>>
>>
>>
>> [wenlei] Yes, this is one of the cases we want to cover. We also have the
>> option to insert nop for critical edges, but we want to avoid that, as it
>> may lead to visible run time overhead.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Context-sensitive Sample PGO*
>>
>> The effectiveness of BOLT, Propeller and CSPGO all demonstrated the
>> importance of context-sensitive profile for PGO. However there are two
>> limitations with the existing approaches.
>>
>> 1.       The current solutions focus on leveraging a context-sensitive
>> profile to attain an accurate post-inline profile that helps achieve a
>> better code layout, but do not use the context-sensitive profile to drive
>> better inlining.
>>
>> 2.       The current solutions involve multiple training processes and
>> profiles (e.g. a post-inline profile for CSPGO, or a post-link profile for
>> BOLT and Propeller), which incurs higher operational cost and complicates
>> the build and release workflow.
>>
>> We propose a full context-sensitive sample profiling infrastructure that
>> utilizes both LBR and call stack samples at the same time to synthesize a
>> profile with a full context sensitivity. The key advantage is that rather
>> than relying on previous inlining or a separate profile, the profile
>> collected with the new approach will have full calling contexts recovered
>> from both inlined and not inlined call sites. To achieve an accurate
>> post-inline profile, a separate profile is no longer needed. Instead, the
>> post-inline profile can be directly derived from adjusting the input
>> profile based on all inline decisions. The richer context-sensitive profile
>> also enables better inline decisions. The infrastructure has two key
>> components listed below.
>>
>>
>>
>> *Synthesizing context-sensitive LBR with a virtual unwinder*
>>
>> To make sample PGO’s input profile context aware, we need to know the
>> call stack of each LBR fall through path. That is done by sampling LBR and
>> call stack simultaneously. With that, each sample will contain a call stack
>> in addition to LBR entries. We use level 2 PEBS to control sampling skid so
>> that the leaf frame from stack sample aligns with leaf frame from LBR. The
>> raw call stack sample describes the calling context for the leaf LBR entry.
>> In addition, by unwinding “call” and “return” (including implicit ones from
>> inlinee) from LBR entries backwards on top of raw stack samples, we can
>> recover the calling context for each of the LBR entries from the sample,
>> thus synthesizing context-sensitive LBR profile.
>>
>>
>>
>> We can then generate context-sensitive sample PGO profile using the
>> context-sensitive LBR profile. In the new profile, a function’s profile
>> becomes a collection of profiles, each representing a profile for a given
>> calling context.
>>
>>
>>
>>
>>
>> Sounds good -- see the overhead question posted at the beginning.
>>
>>
>>
>>
>>
>>
>>
>> *Context-sensitive FDO/PGO framework in LLVM*
>>
>> In order to leverage context-sensitive profile for inlining, and to
>> maintain accurate post-inline counts, we introduced SampleContextTracker which
>> is a layer sitting in between input profile and the profile used to
>> annotate CFG for optimizations. We also introduced the notion of base
>> profile which is the merged profile for function’s profiles from any
>> outstanding (not inlined) context, and context profile which is a
>> function's profile for a given calling context. The framework includes four
>> simple APIs for updating and query profiles:
>>
>>
>>
>> Query API:
>>
>> ·         getBaseSamplesFor: Query base profile by function name.
>>
>> ·         getContextSamplesFor: Query context profile by calling context
>> and function name.
>>
>> Update API:
>>
>> ·         MarkContextSamplesInlined: When a function is inlined for a
>> given calling context, we need to mark the context profile for that context
>> as inlined. This is to make sure we don't include inlined context profile
>> when synthesizing the base profile.
>>
>> ·         PromoteMergeContextSamplesTree: When a function is not inlined
>> for a given calling context, we need to promote the context profile tree to
>> be top-level context. This preserves the child context under that function
>> so later inline decisions for calls originating from that not inlined
>> function will still be driven by an accurate context profile.
>>
>> These APIs are used by SampleProfileLoader’s inlining, for better inline
>> decisions and better post-inline counts. For optimal results, the new
>> infrastructure needs to work with a top-down FDO inliner. We added top-down
>> FDO inlining under a switch, and the switch is turned on by default in
>> upstream recently. There’re a few other improvements for the FDO inliner
>> that we plan to upstream soon.
>>
>>
>>
>> The profile data should be usable by the SCC inliner as well. In the
>> bottom up inlining, as the function gets inline further up in the call
>> chain, the inline instance has few incoming contexts to merge.
>>
>>
>>
>> [wenlei] Yes, we intentionally introduced the SampleContextTracker
>> abstraction that is decoupled from SampleProfileLoader, so it can work with
>> both FDO inliner and SCC inliner. But we expect FDO inliner to take over
>> more inlining for CSSPGO because the FDO inliner is no longer a replay
>> inliner now. And it’s good as top-down inline helps with specialization
>> which is important for context-sensitive inlining.
>>
>>
>>
>>
>>
>>
>>
>> *Pseudo-instrumentation for sample to IR mapping*
>>
>> Being able to profile production binaries is a key advantage of AutoFDO
>> over Instrumentation PGO, but it also comes with a big challenge. While
>> using line number and discriminator as anchor for profile mapping incurs
>> zero run time overhead for AutoFDO, it’s not as accurate as instrumented
>> probes. This is because the instrumented probes are part of the IR, rather
>> than metadata attached to the IR like !dbg. That has two implications:
>> 1) it’s easier to maintain IR than metadata for optimization passes; 2)
>> probe blocks some CFG transformations that can mess up profile correlation.
>>
>>
>>
>> With the proposed pseudo instrumentation, we can achieve most of the
>> benefit of instrumentation PGO in little runtime overhead. We instrument
>> each basic block with a pseudo probe associated with the block Id. Unlike
>> in PGO instrumentation where a counter is implemented as a persisting
>> operation such as atomic read/write or runtime helper call, a pseudo probe
>> is implemented as a dedicated intrinsic call with IntrInaccessibleMemOnly flag.
>> The intrinsic comes with most of the semantics of a PGO counter but is
>> much less optimization-intrusive.
>>
>>
>>
>> The pseudo probe intrinsic calls are on the IR throughout the
>> optimization and code generation pipeline and are materialized as a piece
>> of binary data stored in a separate .pseudo_probe data section.
>>
>>
>>
>> How are these information maintained? Blocks can be eliminated or cloned
>> in many optimization passes: jump threading, taildup, unrolling, peeling
>> etc.  For instance, how to handle the block that is merged into another?
>> Does it lose samples because of this?
>>
>>
>>
>> [wenlei] They are just maintained as part of IR, like any other
>> instructions, without special care. The key difference is they’re part of
>> IR instead of metadata attached to IR. We can categorize relevant CFG
>> transformations into 1) duplication, 2) merge and removal.
>>
>> For any duplication, tail/head dup, unrolling, probe will be duplicated
>> along with other instructions, and we don’t need duplication factor that
>> was used by dwarf-based approach, because counts from duplicated probes
>> will be added together naturally. For merge and removal,
>> IntrInaccessibleMemOnly flag will block it, similar to real probes.
>>
>>
>>
>> Pseudo-probe is a framework that is tunable. Depending on the semantic we
>> put on the intrinsic, it can be as heavy as real probe, or as light as a
>> label. IntrInaccessibleMemOnly is a carefully chosen semantic based on
>> our experiments that balances run time overhead and profile quality – it
>> doesn’t incur measure-able overhead even though it still blocks merging and
>> removal, we didn’t see measure-able overhead from SPEC or a large internal
>> workload. But the profile quality improvement is measure-able as the 1st
>> table in result section shows.
>>
>>
>>
>>
>>
>>
>>
>> The section is then used to map binary samples back to blocks of CFG
>> during profile generation. There are also no real machine instructions
>> generated for a pseudo probe and the.pseudo_probe section won’t be
>> loaded into memory at runtime, therefore they should incur very little
>> runtime overhead. As a fact, we see no measure-able performance impact from
>> pseudo-instrumentation itself on SPEC2006 or big internal workload.
>>
>>
>>
>> How large are the probe sections?
>>
>>
>>
>> [wenlei] About 10% of binary size, another 2% if we encode CFG edges in
>> addition to probes/blocks.
>>
>>
>>
>>
>>
>> *Pseudo-instrumentation integration and Pass Ordering*
>>
>> One implication from pseudo-probe instrumentation is that the profile is
>> now sensitive to CFG changes. We now defect stale profiles for sample PGO
>> via CFG checksum, instead of just using it. However, the potential downside
>> is that CFG may change between different versions of the compiler even if
>> the source code is unchanged. To solve that problem, we perform the pseudo
>> instrumentation very early in the pre-LTO pipeline, before any CFG
>> transformation. This ensures that the CFG instrumented and annotated is
>> stable. We added SampleProfileProber that performs the pseudo
>> instrumentation and runs independent of profile annotation.
>>
>>
>>
>> A new switch -fpseudo-probe-for-profiling is added to enable sample PGO
>> with pseudo instrumentation, similar to -fdebug-info-for-profiling for
>> AutoFDO. Input profile is still provided through the same switch used by
>> today’s AutoFDO, namely -fprofile-sample-use, and the profile loader
>> will automatically decide how to load and annotate profile depending on
>> whether input profile is dwarf-based or pseudo-probe based.
>>
>>
>>
>>
>>
>> Can you compare the source change tolerance of pseudo probe based
>> approach vs debug info based approach?
>>
>>
>>
>> [wenlei] Pseudo-probe should be more resilient to source changes. See my
>> reply for motivation of pseudo-probe. Pseudo-probe will be able to tolerate
>> source changes as long as they don’t alter CFG. On the contrary, changes
>> that delete a comment and shift line offset can cause perf churn with
>> line-based approach. We've been bitten by this a few times – people making
>> comment only change during holiday freeze only to find surprising perf
>> regression due to AutoFDO 😊. It also opens up possibility of fuzzy CFG
>> matching when there’s a CFG mutation due to source change to make it even
>> more resilient.
>>
>>
>>
>>
>>
>> Ok. Also see my reply above. It seems to me that the line shifting
>> problem should be solvable for AFDO (or make it more tolerant).
>>
>>
>>
>> [wenlei] Agreed that we can do better with line number approach too. But
>> CFG as profile carrier has richer info than line, and is closer to profile
>> which is inherently CFG based. So I think it should be easier with probe
>> and CFG.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *New profile format and profile generation*
>>
>> We extend current profile format in order to be able to represent a full
>> context-sensitive profile and also encode pseudo-probe info. This is done
>> without drastically diverging from today’s AutoFDO profile format so that
>> existing tools and libraries can be reused with minor changes (e.g.
>> llvm-profdata, profiler reader and writer).
>>
>>
>>
>> For a context-sensitive profile, we extend the profile format by changing
>> the function profile header line to include calling context in addition to
>> function name. With today’s AutoFDO, we have a single profile header for
>> each function to represent its accumulative profile. E.g. This is the
>> profile header for foo, with 1290 total samples, and 74 header samples.
>>
>>
>>
>> foo:1290:74
>>
>>
>>
>> For CSSPGO, we will have multiple profile headers for a single function’s
>> profile, each representing profile for a specific calling context as shown
>> below. CSSPGO profile header is bracketed to differentiate from today’s
>> AutoFDO.
>>
>>
>>
>> [main:12 @ bar:3 @ foo]:279:54
>>
>> [main:19 @ zoo:7 @ foo]:1011:20
>>
>>
>>
>>
>>
>> sounds good.
>>
>>
>>
>>
>>
>>
>>
>> With calling context encoded in the function header, we no longer need a
>> nested function profile for inlinees. Instead, a context profile will be
>> represented uniformly using context strings in the function profile header,
>> regardless of whether the calls in the context are inlined or not. The flat
>> structure makes sure that context profile is easily indexable. The change
>> is mostly transparent to tools like llvm-profdata. Support for binary
>> profile format has not been added yet, but should be easy to do.
>>
>>
>>
>>
>>
>>
>>
>> It is still useful to annotate (as least with comment line) that a
>> profile is for top level function or inline instance.
>>
>>
>>
>> [wenlei] Agreed, and that’s in our plan too - we need that for tuning
>> purpose.
>>
>>
>>
>>
>>
>> For pseudo-probe, we repurposed the line to count map of AutoFDO profile
>> to be block Id to count map. This only changes the interpretation of
>> profile content rather than the representation, hence all reader/writer
>> helpers can be reused.
>>
>>
>>
>> There's a new profile generation tool, llvm-profgen, with the virtual
>> winder implemented for context-sensitive profiling, and uses the
>> .pseudo_probe section to map binary profile to pre-opt CFG profile.
>> Since profile generation is a critical piece of the workflow, we’d like to
>> propose to include the tool as part of LLVM, alongside with llvm-profdata
>> .
>>
>>
>>
>>
>>
>> *Preliminary Results*
>>
>> To quantitatively assess profile quality improvement brought by
>> pseudo-instrumentation, we introduce a profile quality metric. We measure
>> the metric by first annotating an optimized binary with the MIR block
>> execution counts derived from a profile. The binary is then sampled and the
>> counts are compared against the annotation. The weighted relative delta is
>> used as an indicator for profile quality (lower is better).
>>
>>
>>
>> Table below shows the profile quality metric for SPEC2006. We can see
>> from the numbers that the profile quality of pseudo-instrumentation sample
>> PGO is much better than AutoFDO and close to instrumentation PGO.
>>
>>
>>
>> Profile quality metric
>>
>> Baseline AutoFDO
>>
>> Instrumentation PGO
>>
>> Sample PGO w/ Pseudo Instrumentation
>>
>> SPEC2006
>>
>> 24.58%
>>
>> 15.70%
>>
>> 16.21%
>>
>>
>>
>>
>>
>> Instrumentation PGO does not have context sensitivity, so I would expect
>> it scores worse than CSSPGO. Do you know why it is better here?
>>
>>
>>
>> [wenlei] This is for evaluating effectiveness of pseudo-probe
>> exclusively. We have all inlining turned off for this experiment, and this
>> is without context-sensitive profile for Sample PGO either, so the
>> comparison should be fair, and Instrumentation PGO should be the upper
>> bound.
>>
>>
>>
>>
>>
>> It would be nice to see the main source of precision loss of AFDO here.
>> Probably related to the missing edge information Wei mentioned.
>>
>>
>>
>> [wenlei] The edge count issue Wei mentioned isn’t handled by pseudo probe
>> either, at least not for now. From our investigation, the problem here is
>> more like death by a thousand cut.
>>
>>
>>
>>
>>
>> thanks,
>>
>>
>>
>> David
>>
>>
>>
>>
>>
>>
>>
>> We also measured performance and code size on SPEC2006 with CSSPGO. The
>> measurement was done with MonoLTO and new pass manager, with tuning for FDO
>> inliner to accommodate context-sensitive profile, and used training dataset
>> for both pass1 and pass2. The result shows ~2% performance win on top of
>> today’s AutoFDO, with ~4% .text reduction, see table below.
>>
>>
>>
>> SPEC2006
>>
>> Performance
>>
>> Code Size
>>
>> AutoFDO over LTO
>>
>> CSSPGO
>>
>> Over LTO
>>
>> CSSPGO over AutoFDO
>>
>> AutoFDO over LTO
>>
>> CSSPGO
>>
>> Over LTO
>>
>> CSSPGO over AutoFDO
>>
>> Geomean Delta %
>>
>> 6.80%
>>
>> 8.70%
>>
>> 2.04%
>>
>> 11.17%
>>
>> 6.66%
>>
>> 4.06%
>>
>>
>>
>> While the SPEC2006 benchmark suite is different from large workloads, we
>> think the results demonstrated the potential of CSSPGO and served its
>> purpose for proof of concept. We plan to continue tuning and start to
>> evaluate larger internal workloads soon, and we’d like to upstream our
>> work. Feedbacks are welcomed!
>>
>>
>>
>>
>>
>>
>>
>> What is the performance win with peudo-probe alone?
>>
>>
>>
>> [wenlei] We don’t have numbers for pseudo-probe along. As I mentioned
>> earlier, profile quality improvement may not translate directly to perf win
>> without heuristic changes. That’s why we evaluate pseudo-probe exclusively
>> with profile quality metric. The hope is that it will open up opportunity
>> for better optimizations. E.g. it could potentially help the Machine
>> Function Splitting pass too. That said, pseudo-probe does bring extra win
>> for CSSPGO comparing to line-based CSSPGO for some benchmarks, but we
>> didn’t tune CSSPGO with line-based profile, so we didn’t aggregate numbers
>> as the comparison isn’t fair either.
>>
>>
>>
>>
>>
>> thanks,
>>
>>
>>
>> David
>>
>>
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Wenlei & Hongtao
>>
>>
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.llvm.org_cgi-2Dbin_mailman_listinfo_llvm-2Ddev&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=cILB1YbWQ4KwciWgglfl8A&m=rFyHD7KTOVCsiQSIIXybvwhpIj0GaQtntyiY6YBHvkI&s=FCJhHHeRidmV5OvE18dv1Q_9gjcheP-WEufMUhwHgRM&e=>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200808/adbfddea/attachment-0001.html>