[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager
Fedor Sergeev via llvm-dev
llvm-dev at lists.llvm.org
Mon Mar 4 14:05:20 PST 2019
On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:
> On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi <yamauchi at google.com
> <mailto:yamauchi at google.com>> wrote:
> On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev
> <fedor.sergeev at azul.com <mailto:fedor.sergeev at azul.com>> wrote:
> On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote:
>> Here's a sketch of the proposed approach for just one
>> pass(but imagine more)
>> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via llvm-dev
>> <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev wrote:
>>> Hi all,
>>> To implement more profile-guided optimizations, we’d
>>> like to use ProfileSummaryInfo (PSI) and
>>> BlockFrequencyInfo (BFI) from more passes of various
>>> types, under the new pass manager.
>>> The following is what we came up with. Would appreciate
>>> feedback. Thanks.
>>> It’s not obvious (to me) how to best do this, given that
>>> we cannot request an outer-scope analysis result from an
>>> inner-scope pass through analysis managers  and that
>>> we might unnecessarily running some analyses unless we
>>> conditionally build pass pipelines for PGO cases.
>> Indeed, this is an intentional restriction in new pass
>> manager, which is more or less a reflection of a
>> fundamental property of outer-inner IRUnit relationship
>> and transformations/analyses run on those units. The main
>> intent for having those inner IRUnits (e.g. Loops) is to
>> run local transformations and save compile time
>> on being local to a particular small piece of IR. Loop
>> Pass manager allows you to run a whole pipeline of
>> different transformations still locally, amplifying the save.
>> As soon as you run function-level analysis from within
>> the loop pipeline you essentially break this pipelining.
>> Say, as you run your loop transformation it modifies the
>> loop (and the function) and potentially invalidates the
>> so you have to rerun your analysis again and again. Hence
>> instead of saving on compile time it ends up increasing it.
>> I have hit this issue somewhat recently with dependency
>> of loop passes on BranchProbabilityInfo.
>> (some loop passes, like IRCE can use it for profitability
>> The only solution that appears to be reasonable there is
>> to teach all the loops passes that need to be pipelined
>> to preserve BPI (or any other module/function-level
>> analyses) similar to how they preserve DominatorTree and
>> other "LoopStandard" analyses.
>> Is this implemented - do the loop passes preserve BPI?
> Nope, not implemented right now.
> One of the problems is that even loop canonicalization passes
> run at the start of loop pass manager dont preserve it
> (and at least LoopSimplifyCFG does change control flow).
>> In buildFunctionSimplificationPipeline
>> (where LoopFullUnrollPass is added as in the sketch),
>> and LoopOptimizerEndEPCallbacks seem to allow some arbitrary
>> loop passes to be inserted into the pipelines (via flags)?
>> I wonder how hard it'd be to teach all the relevant loop
>> passes to preserve BFI(or BPI)..
> Well, each time you restructure control flow around the loops
> you will have to update those extra analyses,
> pretty much the same way as DT is being updated through
> The trick is to design a proper update interface (and then
> implement it ;) ).
> And I have not spent enough time on this issue to get a good
> idea of what that interface would be.
> Hm, sounds non-trivial :) noting BFI depends on BPI.
> To step back, it looks like:
> want to use profiles from more passes -> need to get BFI (from loop
> passes) -> need all the loop passes to preserve BFI.
> I wonder if there's no way around this.
Indeed. I believe this is a general consensus here.
>>> It seems that for different types of passes to be able
>>> to get PSI and BFI, we’d need to ensure PSI is cached
>>> for a non-module pass, and PSI, BFI and the
>>> ModuleAnalysisManager proxy are cached for a loop pass
>>> in the pass pipelines. This may mean potentially needing
>>> to insert BFI/PSI in front of many passes . It seems
>>> not obvious how to conditionally insert BFI for PGO
>>> pipelines because there isn’t always a good flag to
>>> detect PGO cases  or we tend to build pass pipelines
>>> before examining the code (or without propagating enough
>>> info down) .
>>> Proposed approach
>>> - Cache PSI right after the profile summary in the IR is
>>> written in the pass pipeline . This would avoid the
>>> need to insert RequiredAnalysisPass for PSI before each
>>> non-module pass that needs it. PSI can be technically
>>> invalidated but unlikely. If it does, we insert another
>>> - Conditionally insert RequireAnalysisPass for BFI, if
>>> PGO, right before each loop pass that needs it. This
>>> doesn't seem avoidable because BFI can be invalidated
>>> whenever the CFG changes. We detect PGO based on the
>>> command line flags and/or whether the module has the
>>> profile summary info (we may need to pass the module to
>>> more functions.)
>>> - Add a new proxy ModuleAnalysisManagerLoopProxy for a
>>> loop pass to be able to get to the ModuleAnalysisManager
>>> in one step and PSI through it.
>>> Alternative approaches
>>> Dropping BFI and use PSI only
>>> We could consider not using BFI and solely relying on
>>> PSI and function-level profiles only (as opposed to
>>> block-level), but profile precision would suffer.
>>> Computing BFI in-place
>>> We could consider computing BFI “in-place” by directly
>>> running BFI outside of the pass manager . This would
>>> let us avoid using the analysis manager constraints but
>>> it would still involve running an outer-scope analysis
>>> from an inner-scope pass and potentially cause problems
>>> in terms of pass pipelining and concurrency. Moreover, a
>>> potential downside of running analyses in-place is that
>>> it won’t take advantage of cached analysis results
>>> provided by the pass manager.
>>> Adding inner-scope versions of PSI and BFI
>>> We could consider adding a function-level and loop-level
>>> PSI and loop-level BFI, which internally act like their
>>> outer-scope versions but provide inner-scope results
>>> only. This way, we could always call getResult for PSI
>>> and BFI. However, this would still involve running an
>>> outer-scope analysis from an inner-scope pass.
>>> Caching the FAM and the MAM proxies
>>> We could consider caching the FunctionalAnalysisManager
>>> and the ModuleAnalysisManager proxies once early on
>>> instead of adding a new proxy. But it seems to not
>>> likely work well because the analysis cache key type
>>> includes the function or the module and some pass may
>>> add a new function for which the proxy wouldn’t be
>>> cached. We’d need to write and insert a pass in select
>>> locations to just fill the cache. Adding the new proxy
>>> would take care of these with a three-line change.
>>> Conditional BFI
>>> We could consider adding a conditional BFI analysis that
>>> is a wrapper around BFI and computes BFI only if
>>> profiles are available (either checking the module has
>>> profile summary or depend on the PSI.) With this, we
>>> wouldn’t need to conditionally build pass pipelines and
>>> may work for the new pass manager. But a similar
>>> wouldn’t work for the old pass manager because we cannot
>>> conditionally depend on an analysis under it.
>> There is LazyBlockFrequencyInfo.
>> Not sure how well it fits this idea.
>> Good point. LazyBlockFrequencyInfo seems usable with the old
>> pass manager (save unnecessary BFI/BPI) and would work for
>> function passes. I think the restriction still applies - a
>> loop pass cannot still request (outer-scope) BFI, lazy or
>> not, new or old (pass manager). Another assumption is that
>> it'd be cheap and safe to unconditionally depend on PSI or
>> check the module's profile summary.
>>>  We cannot call AnalysisManager::getResult for an
>>> outer scope but only getCachedResult. Probably because
>>> of potential pipelining or concurrency issues.
>>>  For example, potentially breaking up multiple
>>> pipelined loop passes and insert
>>> RequireAnalysisPass<BlockFrequencyAnalysis> in front of
>>> each of them.
>>>  For example, -fprofile-instr-use and
>>> -fprofile-sample-use aren’t present in ThinLTO post link
>>>  For example, we could check whether the module has
>>> the profile summary metadata annotated when building
>>> pass pipelines but we don’t always pass the module down
>>> to the place where we build pass pipelines.
>>>  By inserting RequireAnalysisPass<ProfileSummaryInfo>
>>> after the PGOInstrumentationUse and the
>>> SampleProfileLoaderPass passes (and around the
>>> PGOIndirectCallPromotion pass for the Thin LTO post link
>>>  For example, the context-sensitive PGO.
>>>  Directly calling its constructor along with the
>>> dependent analyses results, eg. the jump threading pass.
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev