[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager

Mon Mar 11 10:12:18 PDT 2019

Here's a revised approach based on the discussion:

- Cache PSI right after the profile summary in the IR is written in the
pass pipeline. This would avoid the need to insert RequireAnalysisPass for
PSI before each non-module pass that needs it. PSI can be technically
invalidated but unlikely as PSI is immutable. If it does, we can insert
another RequireAnalysisPass.

- If PGO, conditionally request BFI from the passes that need it. For
(pipelined) loop passes, we need to insert a pass that computes BFI
conditionally (if PGO) in front of them and make them preserve BFI through.
This is to avoid pipeline interruptions and potential
invalidation/recomputation
of BFI between the loop passes. We detect PGO based on whether PSI has
profile summary info. (For the old pass manager, implement a similar
approach by using LazyBlockFrequencyInfo.)

- Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be able
to get to the ModuleAnalysisManager in one step and PSI through it.

On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev <fedor.sergeev at azul.com> wrote:

>
>
> On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:
>
>
>
> On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi <yamauchi at google.com>
> wrote:
>
>>
>>
>> On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev <fedor.sergeev at azul.com>
>> wrote:
>>
>>>
>>>
>>> On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote:
>>>
>>> Here's a sketch of the proposed approach for just one pass (but imagine
>>> more)
>>>
>>> https://reviews.llvm.org/D58845
>>>
>>> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev wrote:
>>>>
>>>> Hi all,
>>>>
>>>> To implement more profile-guided optimizations, we’d like to use
>>>> ProfileSummaryInfo (PSI) and BlockFrequencyInfo (BFI) from more passes of
>>>> various types, under the new pass manager.
>>>>
>>>> The following is what we came up with. Would appreciate feedback.
>>>> Thanks.
>>>>
>>>> Issue
>>>>
>>>> It’s not obvious (to me) how to best do this, given that we cannot
>>>> request an outer-scope analysis result from an inner-scope pass through
>>>> analysis managers [1] and that we might unnecessarily running some analyses
>>>> unless we conditionally build pass pipelines for PGO cases.
>>>>
>>>> Indeed, this is an intentional restriction in new pass manager, which
>>>> is more or less a reflection of a fundamental property of outer-inner
>>>> IRUnit relationship
>>>> and transformations/analyses run on those units. The main intent for
>>>> having those inner IRUnits (e.g. Loops) is to run local transformations and
>>>> save compile time
>>>> on being local to a particular small piece of IR. Loop Pass manager
>>>> allows you to run a whole pipeline of different transformations still
>>>> locally, amplifying the save.
>>>> As soon as you run function-level analysis from within the loop
>>>> pipeline you essentially break this pipelining.
>>>> Say, as you run your loop transformation it modifies the loop (and the
>>>> function) and potentially invalidates the analysis,
>>>> so you have to rerun your analysis again and again. Hence instead of
>>>> saving on compile time it ends up increasing it.
>>>>
>>>
>>> Exactly.
>>>
>>>
>>>> I have hit this issue somewhat recently with dependency of loop passes
>>>> on BranchProbabilityInfo.
>>>> (some loop passes, like IRCE can use it for profitability analysis).
>>>>
>>> The only solution that appears to be reasonable there is to teach all
>>>> the loops passes that need to be pipelined
>>>> to preserve BPI (or any other module/function-level analyses) similar
>>>> to how they preserve DominatorTree and
>>>> other "LoopStandard" analyses.
>>>>
>>>
>>> Is this implemented - do the loop passes preserve BPI?
>>>
>>> Nope, not implemented right now.
>>> One of the problems is that even loop canonicalization passes run at the
>>> start of loop pass manager dont preserve it
>>> (and at least LoopSimplifyCFG does change control flow).
>>>
>>>
>>> In buildFunctionSimplificationPipeline (where LoopFullUnrollPass is
>>> added as in the sketch), LateLoopOptimizationsEPCallbacks
>>> and LoopOptimizerEndEPCallbacks seem to allow some arbitrary loop passes to
>>> be inserted into the pipelines (via flags)?
>>>
>>> I wonder how hard it'd be to teach all the relevant loop passes to
>>> preserve BFI (or BPI)..
>>>
>>> Well, each time you restructure control flow around the loops you will
>>> have to update those extra analyses,
>>> pretty much the same way as DT is being updated through DomTreeUpdater.
>>> The trick is to design a proper update interface (and then implement it
>>> ;) ).
>>> And I have not spent enough time on this issue to get a good idea of
>>> what that interface would be.
>>>
>>
>> Hm, sounds non-trivial :) noting BFI depends on BPI.
>>
>
> To step back, it looks like:
>
> want to use profiles from more passes -> need to get BFI (from loop
> passes) -> need all the loop passes to preserve BFI.
>
> I wonder if there's no way around this.
>
> Indeed. I believe this is a general consensus here.
>
> regards,
>   Fedor.
>
>
>
>>
>>> regards,
>>>   Fedor.
>>>
>>>
>>>> It seems that for different types of passes to be able to get PSI and
>>>> BFI, we’d need to ensure PSI is cached for a non-module pass, and PSI, BFI
>>>> and the ModuleAnalysisManager proxy are cached for a loop pass in the pass
>>>> pipelines. This may mean potentially needing to insert BFI/PSI in front of
>>>> many passes [2]. It seems not obvious how to conditionally insert BFI for
>>>> PGO pipelines because there isn’t always a good flag to detect PGO cases
>>>> [3] or we tend to build pass pipelines before examining the code (or
>>>> without propagating enough info down) [4].
>>>>
>>>> Proposed approach
>>>>
>>>> - Cache PSI right after the profile summary in the IR is written in the
>>>> pass pipeline [5]. This would avoid the need to insert RequiredAnalysisPass
>>>> for PSI before each non-module pass that needs it. PSI can be technically
>>>> invalidated but unlikely. If it does, we insert another RequiredAnalysisPass
>>>>  [6].
>>>>
>>>> - Conditionally insert RequireAnalysisPass for BFI, if PGO, right
>>>> before each loop pass that needs it. This doesn't seem avoidable because
>>>> BFI can be invalidated whenever the CFG changes. We detect PGO based on the
>>>> command line flags and/or whether the module has the profile summary
>>>> info (we may need to pass the module to more functions.)
>>>>
>>>> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be
>>>> able to get to the ModuleAnalysisManager in one step and PSI through it.
>>>>
>>>> Alternative approaches
>>>>
>>>> Dropping BFI and use PSI only
>>>> We could consider not using BFI and solely relying on PSI and
>>>> function-level profiles only (as opposed to block-level), but profile
>>>> precision would suffer.
>>>>
>>>> Computing BFI in-place
>>>> We could consider computing BFI “in-place” by directly running BFI
>>>> outside of the pass manager [7]. This would let us avoid using the analysis
>>>> manager constraints but it would still involve running an outer-scope
>>>> analysis from an inner-scope pass and potentially cause problems in terms
>>>> of pass pipelining and concurrency. Moreover, a potential downside of
>>>> running analyses in-place is that it won’t take advantage of cached
>>>> analysis results provided by the pass manager.
>>>>
>>>> Adding inner-scope versions of PSI and BFI
>>>> We could consider adding a function-level and loop-level PSI and
>>>> loop-level BFI, which internally act like their outer-scope versions but
>>>> provide inner-scope results only. This way, we could always call getResult
>>>> for PSI and BFI. However, this would still involve running an outer-scope
>>>> analysis from an inner-scope pass.
>>>>
>>>> Caching the FAM and the MAM proxies
>>>> We could consider caching the FunctionalAnalysisManager and the
>>>> ModuleAnalysisManager proxies once early on instead of adding a new proxy.
>>>> But it seems to not likely work well because the analysis cache key type
>>>> includes the function or the module and some pass may add a new function
>>>> for which the proxy wouldn’t be cached. We’d need to write and insert a
>>>> pass in select locations to just fill the cache. Adding the new proxy would
>>>> take care of these with a three-line change.
>>>>
>>>> Conditional BFI
>>>> We could consider adding a conditional BFI analysis that is a wrapper
>>>> around BFI and computes BFI only if profiles are available (either checking
>>>> the module has profile summary or depend on the PSI.) With this, we
>>>> wouldn’t need to conditionally build pass pipelines and may work for the
>>>> new pass manager. But a similar wouldn’t work for the old pass manager
>>>> because we cannot conditionally depend on an analysis under it.
>>>>
>>>> There is LazyBlockFrequencyInfo.
>>>> Not sure how well it fits this idea.
>>>>
>>>
>>> Good point. LazyBlockFrequencyInfo seems usable with the old pass
>>> manager (save unnecessary BFI/BPI) and would work for function passes. I
>>> think the restriction still applies - a loop pass cannot still request
>>> (outer-scope) BFI, lazy or not, new or old (pass manager). Another
>>> assumption is that it'd be cheap and safe to unconditionally depend on
>>> PSI or check the module's profile summary.
>>>
>>>
>>>> regards,
>>>>   Fedor.
>>>>
>>>>
>>>>
>>>> [1] We cannot call AnalysisManager::getResult for an outer scope but
>>>> only getCachedResult. Probably because of potential pipelining or
>>>> concurrency issues.
>>>> [2] For example, potentially breaking up multiple pipelined loop passes
>>>> and insert RequireAnalysisPass<BlockFrequencyAnalysis> in front of each of
>>>> them.
>>>> [3] For example, -fprofile-instr-use and -fprofile-sample-use aren’t
>>>> present in ThinLTO post link builds.
>>>> [4] For example, we could check whether the module has the profile
>>>> summary metadata annotated when building pass pipelines but we don’t always
>>>> pass the module down to the place where we build pass pipelines.
>>>> [5] By inserting RequireAnalysisPass<ProfileSummaryInfo> after the
>>>> PGOInstrumentationUse and the SampleProfileLoaderPass passes (and around
>>>> the PGOIndirectCallPromotion pass for the Thin LTO post link pipeline.)
>>>> [6] For example, the context-sensitive PGO.
>>>> [7] Directly calling its constructor along with the dependent analyses
>>>> results, eg. the jump threading pass.
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190311/dcb6b6ff/attachment-0001.html>