[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

Wed Mar 13 16:20:32 PDT 2019


On 3/14/19 2:04 AM, Hiroshi Yamauchi wrote:
>
>
> On Wed, Mar 13, 2019 at 2:37 PM Fedor Sergeev <fedor.sergeev at azul.com 
> <mailto:fedor.sergeev at azul.com>> wrote:
>
>>
>>     - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass
>>     to be able to get to the ModuleAnalysisManager in one step and
>>     PSI through it.
>     This is just an optimization of compile-time, saves one
>     indirection through FunctionAnalysisManager.
>     I'm not even sure if it is worth the effort. And definitely not
>     crucial for the overall idea.
>
>
> This should probably be clarified to something like:
>
> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be 
> able to get to the ModuleAnalysisManager and PSI because it may not 
> always through (const) FunctionAnalysisManager, 
> unless ModuleAnalysisManagerFunctionProxy is already cached.
>
> Since FunctionAnalysisManager we can get from LoopAnalysisManager is a 
> const ref, we cannot call getResult on it and always get 
> ModuleAnalysisManager and PSI (see below.) This actually happens in my 
> experiment.
>
> SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) {
> auto &FAM = LAM.getResult<FunctionAnalysisManagerLoopProxy>(L, 
> AR).getManager();
> auto *MAMProxy = FAM.getCachedResult<ModuleAnalysisManagerFunctionProxy>(
>   L.getHeader()->getParent()); *// Can be null*
Oh... well...

> If (MAMProxy) {
> auto &MAM = MAMProxy->getManager();
> auto *PSI = MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
>   } else {
> *// Can't get MAM and PSI.*
>   }
> ...
>
> ->
>
> SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) {
> auto &MAM = LAM.getResult<ModuleAnalysisManagerLoopProxy>(L, 
> AR).getManager(); *// Not null*
> auto *PSI = MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
>   ...
>
>
> AFAICT, adding ModuleAnalysisManagerLoopProxy seems to be as simple as:
>
> /// A proxy from a \c ModuleAnalysisManager to a \c Loop.
> typedef OuterAnalysisManagerProxy<ModuleAnalysisManager, Loop,
> LoopStandardAnalysisResults &>
> ModuleAnalysisManagerLoopProxy;
It also needs to be added to PassBuilder::crossRegisterProxies...
But yes, that appears to be a required action.

regards,
   Fedor.
>
>
>
>     regards,
>       Fedor.
>>
>>
>>
>>
>>     On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev
>>     <fedor.sergeev at azul.com <mailto:fedor.sergeev at azul.com>> wrote:
>>
>>
>>
>>         On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:
>>>
>>>
>>>         On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi
>>>         <yamauchi at google.com <mailto:yamauchi at google.com>> wrote:
>>>
>>>
>>>
>>>             On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev
>>>             <fedor.sergeev at azul.com <mailto:fedor.sergeev at azul.com>>
>>>             wrote:
>>>
>>>
>>>
>>>                 On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote:
>>>>                 Here's a sketch of the proposed approach for just
>>>>                 one pass(but imagine more)
>>>>
>>>>                 https://reviews.llvm.org/D58845
>>>>
>>>>                 On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via
>>>>                 llvm-dev <llvm-dev at lists.llvm.org
>>>>                 <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>>
>>>>                     On 2/28/19 12:47 AM, Hiroshi Yamauchi via
>>>>                     llvm-dev wrote:
>>>>>                     Hi all,
>>>>>
>>>>>                     To implement more profile-guided
>>>>>                     optimizations, we’d like to use
>>>>>                     ProfileSummaryInfo (PSI) and
>>>>>                     BlockFrequencyInfo (BFI) from more passes of
>>>>>                     various types, under the new pass manager.
>>>>>
>>>>>                     The following is what we came up with. Would
>>>>>                     appreciate feedback. Thanks.
>>>>>
>>>>>                     Issue
>>>>>
>>>>>                     It’s not obvious (to me) how to best do this,
>>>>>                     given that we cannot request an outer-scope
>>>>>                     analysis result from an inner-scope pass
>>>>>                     through analysis managers [1] and that we
>>>>>                     might unnecessarily running some analyses
>>>>>                     unless we conditionally build pass pipelines
>>>>>                     for PGO cases.
>>>>                     Indeed, this is an intentional restriction in
>>>>                     new pass manager, which is more or less a
>>>>                     reflection of a fundamental property of
>>>>                     outer-inner IRUnit relationship
>>>>                     and transformations/analyses run on those
>>>>                     units. The main intent for having those inner
>>>>                     IRUnits (e.g. Loops) is to run local
>>>>                     transformations and save compile time
>>>>                     on being local to a particular small piece of
>>>>                     IR. Loop Pass manager allows you to run a whole
>>>>                     pipeline of different transformations still
>>>>                     locally, amplifying the save.
>>>>                     As soon as you run function-level analysis from
>>>>                     within the loop pipeline you essentially break
>>>>                     this pipelining.
>>>>                     Say, as you run your loop transformation it
>>>>                     modifies the loop (and the function) and
>>>>                     potentially invalidates the analysis,
>>>>                     so you have to rerun your analysis again and
>>>>                     again. Hence instead of saving on compile time
>>>>                     it ends up increasing it.
>>>>
>>>>
>>>>                 Exactly.
>>>>
>>>>
>>>>                     I have hit this issue somewhat recently with
>>>>                     dependency of loop passes on BranchProbabilityInfo.
>>>>                     (some loop passes, like IRCE can use it for
>>>>                     profitability analysis).
>>>>
>>>>                     The only solution that appears to be reasonable
>>>>                     there is to teach all the loops passes that
>>>>                     need to be pipelined
>>>>                     to preserve BPI (or any other
>>>>                     module/function-level analyses) similar to how
>>>>                     they preserve DominatorTree and
>>>>                     other "LoopStandard" analyses.
>>>>
>>>>
>>>>                 Is this implemented - do the loop passes preserve BPI?
>>>                 Nope, not implemented right now.
>>>                 One of the problems is that even loop
>>>                 canonicalization passes run at the start of loop
>>>                 pass manager dont preserve it
>>>                 (and at least LoopSimplifyCFG does change control flow).
>>>>
>>>>                 In buildFunctionSimplificationPipeline
>>>>                 (where LoopFullUnrollPass is added as in the
>>>>                 sketch), LateLoopOptimizationsEPCallbacks
>>>>                 and LoopOptimizerEndEPCallbacks seem to allow some
>>>>                 arbitrary loop passes to be inserted into the
>>>>                 pipelines (via flags)?
>>>>
>>>>                 I wonder how hard it'd be to teach all the relevant
>>>>                 loop passes to preserve BFI(or BPI)..
>>>                 Well, each time you restructure control flow around
>>>                 the loops you will have to update those extra analyses,
>>>                 pretty much the same way as DT is being updated
>>>                 through DomTreeUpdater.
>>>                 The trick is to design a proper update interface
>>>                 (and then implement it ;) ).
>>>                 And I have not spent enough time on this issue to
>>>                 get a good idea of what that interface would be.
>>>
>>>
>>>             Hm, sounds non-trivial :) noting BFI depends on BPI.
>>>
>>>
>>>         To step back, it looks like:
>>>
>>>         want to use profiles from more passes -> need to get BFI
>>>         (from loop passes) -> need all the loop passes to preserve BFI.
>>>
>>>         I wonder if there's no way around this.
>>         Indeed. I believe this is a general consensus here.
>>
>>         regards,
>>           Fedor.
>>
>>>
>>>
>>>
>>>                 regards,
>>>                   Fedor.
>>>
>>>>
>>>>>                     It seems that for different types of passes to
>>>>>                     be able to get PSI and BFI, we’d need to
>>>>>                     ensure PSI is cached for a non-module pass,
>>>>>                     and PSI, BFI and the ModuleAnalysisManager
>>>>>                     proxy are cached for a loop pass in the pass
>>>>>                     pipelines. This may mean potentially needing
>>>>>                     to insert BFI/PSI in front of many passes [2].
>>>>>                     It seems not obvious how to conditionally
>>>>>                     insert BFI for PGO pipelines because there
>>>>>                     isn’t always a good flag to detect PGO cases
>>>>>                     [3] or we tend to build pass pipelines before
>>>>>                     examining the code (or without propagating
>>>>>                     enough info down) [4].
>>>>>
>>>>>                     Proposed approach
>>>>>
>>>>>                     - Cache PSI right after the profile summary in
>>>>>                     the IR is written in the pass pipeline [5].
>>>>>                     This would avoid the need to insert
>>>>>                     RequiredAnalysisPass for PSI before each
>>>>>                     non-module pass that needs it. PSI can be
>>>>>                     technically invalidated but unlikely. If it
>>>>>                     does, we insert another RequiredAnalysisPass[6].
>>>>>
>>>>>                     - Conditionally insert RequireAnalysisPass for
>>>>>                     BFI, if PGO, right before each loop pass that
>>>>>                     needs it. This doesn't seem avoidable because
>>>>>                     BFI can be invalidated whenever the CFG
>>>>>                     changes. We detect PGO based on the command
>>>>>                     line flags and/or whether the module has the
>>>>>                     profile summary info (we may need to pass the
>>>>>                     module to more functions.)
>>>>>
>>>>>                     - Add a new proxy
>>>>>                     ModuleAnalysisManagerLoopProxy for a loop pass
>>>>>                     to be able to get to the ModuleAnalysisManager
>>>>>                     in one step and PSI through it.
>>>>>
>>>>>                     Alternative approaches
>>>>>
>>>>>                     Dropping BFI and use PSI only
>>>>>                     We could consider not using BFI and solely
>>>>>                     relying on PSI and function-level profiles
>>>>>                     only (as opposed to block-level), but profile
>>>>>                     precision would suffer.
>>>>>
>>>>>                     Computing BFI in-place
>>>>>                     We could consider computing BFI “in-place” by
>>>>>                     directly running BFI outside of the pass
>>>>>                     manager [7]. This would let us avoid using the
>>>>>                     analysis manager constraints but it would
>>>>>                     still involve running an outer-scope analysis
>>>>>                     from an inner-scope pass and potentially cause
>>>>>                     problems in terms of pass pipelining and
>>>>>                     concurrency. Moreover, a potential downside of
>>>>>                     running analyses in-place is that it won’t
>>>>>                     take advantage of cached analysis results
>>>>>                     provided by the pass manager.
>>>>>
>>>>>                     Adding inner-scope versions of PSI and BFI
>>>>>                     We could consider adding a function-level and
>>>>>                     loop-level PSI and loop-level BFI, which
>>>>>                     internally act like their outer-scope versions
>>>>>                     but provide inner-scope results only. This
>>>>>                     way, we could always call getResult for PSI
>>>>>                     and BFI. However, this would still involve
>>>>>                     running an outer-scope analysis from an
>>>>>                     inner-scope pass.
>>>>>
>>>>>                     Caching the FAM and the MAM proxies
>>>>>                     We could consider caching the
>>>>>                     FunctionalAnalysisManager and the
>>>>>                     ModuleAnalysisManager proxies once early on
>>>>>                     instead of adding a new proxy. But it seems to
>>>>>                     not likely work well because the analysis
>>>>>                     cache key type includes the function or the
>>>>>                     module and some pass may add a new function
>>>>>                     for which the proxy wouldn’t be cached. We’d
>>>>>                     need to write and insert a pass in select
>>>>>                     locations to just fill the cache. Adding the
>>>>>                     new proxy would take care of these with a
>>>>>                     three-line change.
>>>>>
>>>>>                     Conditional BFI
>>>>>                     We could consider adding a conditional BFI
>>>>>                     analysis that is a wrapper around BFI and
>>>>>                     computes BFI only if profiles are available
>>>>>                     (either checking the module has profile
>>>>>                     summary or depend on the PSI.) With this, we
>>>>>                     wouldn’t need to conditionally build pass
>>>>>                     pipelines and may work for the new pass
>>>>>                     manager. But a similar wouldn’t work for the
>>>>>                     old pass manager because we cannot
>>>>>                     conditionally depend on an analysis under it.
>>>>                     There is LazyBlockFrequencyInfo.
>>>>                     Not sure how well it fits this idea.
>>>>
>>>>
>>>>                 Good point. LazyBlockFrequencyInfo seems usable
>>>>                 with the old pass manager (save unnecessary
>>>>                 BFI/BPI) and would work for function passes. I
>>>>                 think the restriction still applies - a loop pass
>>>>                 cannot still request (outer-scope) BFI, lazy or
>>>>                 not, new or old (pass manager). Another assumption
>>>>                 is that it'd be cheap and safe to unconditionally
>>>>                 depend on PSI or check the module's profile summary.
>>>>
>>>>
>>>>                     regards,
>>>>                       Fedor.
>>>>
>>>>>
>>>>>
>>>>>                     [1] We cannot call AnalysisManager::getResult
>>>>>                     for an outer scope but only getCachedResult.
>>>>>                     Probably because of potential pipelining or
>>>>>                     concurrency issues.
>>>>>                     [2] For example, potentially breaking up
>>>>>                     multiple pipelined loop passes and insert
>>>>>                     RequireAnalysisPass<BlockFrequencyAnalysis> in
>>>>>                     front of each of them.
>>>>>                     [3] For example, -fprofile-instr-use and
>>>>>                     -fprofile-sample-use aren’t present in ThinLTO
>>>>>                     post link builds.
>>>>>                     [4] For example, we could check whether the
>>>>>                     module has the profile summary metadata
>>>>>                     annotated when building pass pipelines but we
>>>>>                     don’t always pass the module down to the place
>>>>>                     where we build pass pipelines.
>>>>>                     [5] By inserting
>>>>>                     RequireAnalysisPass<ProfileSummaryInfo> after
>>>>>                     the PGOInstrumentationUse and the
>>>>>                     SampleProfileLoaderPass passes (and around the
>>>>>                     PGOIndirectCallPromotion pass for the Thin LTO
>>>>>                     post link pipeline.)
>>>>>                     [6] For example, the context-sensitive PGO.
>>>>>                     [7] Directly calling its constructor along
>>>>>                     with the dependent analyses results, eg. the
>>>>>                     jump threading pass.
>>>>>
>>>>>                     _______________________________________________
>>>>>                     LLVM Developers mailing list
>>>>>                     llvm-dev at lists.llvm.org  <mailto:llvm-dev at lists.llvm.org>
>>>>>                     https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>                     _______________________________________________
>>>>                     LLVM Developers mailing list
>>>>                     llvm-dev at lists.llvm.org
>>>>                     <mailto:llvm-dev at lists.llvm.org>
>>>>                     https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190314/828db884/attachment-0001.html>