[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager
Fedor Sergeev via llvm-dev
llvm-dev at lists.llvm.org
Wed Mar 13 16:20:32 PDT 2019
On 3/14/19 2:04 AM, Hiroshi Yamauchi wrote:
>
>
> On Wed, Mar 13, 2019 at 2:37 PM Fedor Sergeev <fedor.sergeev at azul.com
> <mailto:fedor.sergeev at azul.com>> wrote:
>
>>
>> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass
>> to be able to get to the ModuleAnalysisManager in one step and
>> PSI through it.
> This is just an optimization of compile-time, saves one
> indirection through FunctionAnalysisManager.
> I'm not even sure if it is worth the effort. And definitely not
> crucial for the overall idea.
>
>
> This should probably be clarified to something like:
>
> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be
> able to get to the ModuleAnalysisManager and PSI because it may not
> always through (const) FunctionAnalysisManager,
> unless ModuleAnalysisManagerFunctionProxy is already cached.
>
> Since FunctionAnalysisManager we can get from LoopAnalysisManager is a
> const ref, we cannot call getResult on it and always get
> ModuleAnalysisManager and PSI (see below.) This actually happens in my
> experiment.
>
> SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) {
> auto &FAM = LAM.getResult<FunctionAnalysisManagerLoopProxy>(L,
> AR).getManager();
> auto *MAMProxy = FAM.getCachedResult<ModuleAnalysisManagerFunctionProxy>(
> L.getHeader()->getParent()); *// Can be null*
Oh... well...
> If (MAMProxy) {
> auto &MAM = MAMProxy->getManager();
> auto *PSI = MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
> } else {
> *// Can't get MAM and PSI.*
> }
> ...
>
> ->
>
> SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) {
> auto &MAM = LAM.getResult<ModuleAnalysisManagerLoopProxy>(L,
> AR).getManager(); *// Not null*
> auto *PSI = MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
> ...
>
>
> AFAICT, adding ModuleAnalysisManagerLoopProxy seems to be as simple as:
>
> /// A proxy from a \c ModuleAnalysisManager to a \c Loop.
> typedef OuterAnalysisManagerProxy<ModuleAnalysisManager, Loop,
> LoopStandardAnalysisResults &>
> ModuleAnalysisManagerLoopProxy;
It also needs to be added to PassBuilder::crossRegisterProxies...
But yes, that appears to be a required action.
regards,
Fedor.
>
>
>
> regards,
> Fedor.
>>
>>
>>
>>
>> On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev
>> <fedor.sergeev at azul.com <mailto:fedor.sergeev at azul.com>> wrote:
>>
>>
>>
>> On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:
>>>
>>>
>>> On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi
>>> <yamauchi at google.com <mailto:yamauchi at google.com>> wrote:
>>>
>>>
>>>
>>> On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev
>>> <fedor.sergeev at azul.com <mailto:fedor.sergeev at azul.com>>
>>> wrote:
>>>
>>>
>>>
>>> On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote:
>>>> Here's a sketch of the proposed approach for just
>>>> one pass(but imagine more)
>>>>
>>>> https://reviews.llvm.org/D58845
>>>>
>>>> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via
>>>> llvm-dev <llvm-dev at lists.llvm.org
>>>> <mailto:llvm-dev at lists.llvm.org>> wrote:
>>>>
>>>> On 2/28/19 12:47 AM, Hiroshi Yamauchi via
>>>> llvm-dev wrote:
>>>>> Hi all,
>>>>>
>>>>> To implement more profile-guided
>>>>> optimizations, we’d like to use
>>>>> ProfileSummaryInfo (PSI) and
>>>>> BlockFrequencyInfo (BFI) from more passes of
>>>>> various types, under the new pass manager.
>>>>>
>>>>> The following is what we came up with. Would
>>>>> appreciate feedback. Thanks.
>>>>>
>>>>> Issue
>>>>>
>>>>> It’s not obvious (to me) how to best do this,
>>>>> given that we cannot request an outer-scope
>>>>> analysis result from an inner-scope pass
>>>>> through analysis managers [1] and that we
>>>>> might unnecessarily running some analyses
>>>>> unless we conditionally build pass pipelines
>>>>> for PGO cases.
>>>> Indeed, this is an intentional restriction in
>>>> new pass manager, which is more or less a
>>>> reflection of a fundamental property of
>>>> outer-inner IRUnit relationship
>>>> and transformations/analyses run on those
>>>> units. The main intent for having those inner
>>>> IRUnits (e.g. Loops) is to run local
>>>> transformations and save compile time
>>>> on being local to a particular small piece of
>>>> IR. Loop Pass manager allows you to run a whole
>>>> pipeline of different transformations still
>>>> locally, amplifying the save.
>>>> As soon as you run function-level analysis from
>>>> within the loop pipeline you essentially break
>>>> this pipelining.
>>>> Say, as you run your loop transformation it
>>>> modifies the loop (and the function) and
>>>> potentially invalidates the analysis,
>>>> so you have to rerun your analysis again and
>>>> again. Hence instead of saving on compile time
>>>> it ends up increasing it.
>>>>
>>>>
>>>> Exactly.
>>>>
>>>>
>>>> I have hit this issue somewhat recently with
>>>> dependency of loop passes on BranchProbabilityInfo.
>>>> (some loop passes, like IRCE can use it for
>>>> profitability analysis).
>>>>
>>>> The only solution that appears to be reasonable
>>>> there is to teach all the loops passes that
>>>> need to be pipelined
>>>> to preserve BPI (or any other
>>>> module/function-level analyses) similar to how
>>>> they preserve DominatorTree and
>>>> other "LoopStandard" analyses.
>>>>
>>>>
>>>> Is this implemented - do the loop passes preserve BPI?
>>> Nope, not implemented right now.
>>> One of the problems is that even loop
>>> canonicalization passes run at the start of loop
>>> pass manager dont preserve it
>>> (and at least LoopSimplifyCFG does change control flow).
>>>>
>>>> In buildFunctionSimplificationPipeline
>>>> (where LoopFullUnrollPass is added as in the
>>>> sketch), LateLoopOptimizationsEPCallbacks
>>>> and LoopOptimizerEndEPCallbacks seem to allow some
>>>> arbitrary loop passes to be inserted into the
>>>> pipelines (via flags)?
>>>>
>>>> I wonder how hard it'd be to teach all the relevant
>>>> loop passes to preserve BFI(or BPI)..
>>> Well, each time you restructure control flow around
>>> the loops you will have to update those extra analyses,
>>> pretty much the same way as DT is being updated
>>> through DomTreeUpdater.
>>> The trick is to design a proper update interface
>>> (and then implement it ;) ).
>>> And I have not spent enough time on this issue to
>>> get a good idea of what that interface would be.
>>>
>>>
>>> Hm, sounds non-trivial :) noting BFI depends on BPI.
>>>
>>>
>>> To step back, it looks like:
>>>
>>> want to use profiles from more passes -> need to get BFI
>>> (from loop passes) -> need all the loop passes to preserve BFI.
>>>
>>> I wonder if there's no way around this.
>> Indeed. I believe this is a general consensus here.
>>
>> regards,
>> Fedor.
>>
>>>
>>>
>>>
>>> regards,
>>> Fedor.
>>>
>>>>
>>>>> It seems that for different types of passes to
>>>>> be able to get PSI and BFI, we’d need to
>>>>> ensure PSI is cached for a non-module pass,
>>>>> and PSI, BFI and the ModuleAnalysisManager
>>>>> proxy are cached for a loop pass in the pass
>>>>> pipelines. This may mean potentially needing
>>>>> to insert BFI/PSI in front of many passes [2].
>>>>> It seems not obvious how to conditionally
>>>>> insert BFI for PGO pipelines because there
>>>>> isn’t always a good flag to detect PGO cases
>>>>> [3] or we tend to build pass pipelines before
>>>>> examining the code (or without propagating
>>>>> enough info down) [4].
>>>>>
>>>>> Proposed approach
>>>>>
>>>>> - Cache PSI right after the profile summary in
>>>>> the IR is written in the pass pipeline [5].
>>>>> This would avoid the need to insert
>>>>> RequiredAnalysisPass for PSI before each
>>>>> non-module pass that needs it. PSI can be
>>>>> technically invalidated but unlikely. If it
>>>>> does, we insert another RequiredAnalysisPass[6].
>>>>>
>>>>> - Conditionally insert RequireAnalysisPass for
>>>>> BFI, if PGO, right before each loop pass that
>>>>> needs it. This doesn't seem avoidable because
>>>>> BFI can be invalidated whenever the CFG
>>>>> changes. We detect PGO based on the command
>>>>> line flags and/or whether the module has the
>>>>> profile summary info (we may need to pass the
>>>>> module to more functions.)
>>>>>
>>>>> - Add a new proxy
>>>>> ModuleAnalysisManagerLoopProxy for a loop pass
>>>>> to be able to get to the ModuleAnalysisManager
>>>>> in one step and PSI through it.
>>>>>
>>>>> Alternative approaches
>>>>>
>>>>> Dropping BFI and use PSI only
>>>>> We could consider not using BFI and solely
>>>>> relying on PSI and function-level profiles
>>>>> only (as opposed to block-level), but profile
>>>>> precision would suffer.
>>>>>
>>>>> Computing BFI in-place
>>>>> We could consider computing BFI “in-place” by
>>>>> directly running BFI outside of the pass
>>>>> manager [7]. This would let us avoid using the
>>>>> analysis manager constraints but it would
>>>>> still involve running an outer-scope analysis
>>>>> from an inner-scope pass and potentially cause
>>>>> problems in terms of pass pipelining and
>>>>> concurrency. Moreover, a potential downside of
>>>>> running analyses in-place is that it won’t
>>>>> take advantage of cached analysis results
>>>>> provided by the pass manager.
>>>>>
>>>>> Adding inner-scope versions of PSI and BFI
>>>>> We could consider adding a function-level and
>>>>> loop-level PSI and loop-level BFI, which
>>>>> internally act like their outer-scope versions
>>>>> but provide inner-scope results only. This
>>>>> way, we could always call getResult for PSI
>>>>> and BFI. However, this would still involve
>>>>> running an outer-scope analysis from an
>>>>> inner-scope pass.
>>>>>
>>>>> Caching the FAM and the MAM proxies
>>>>> We could consider caching the
>>>>> FunctionalAnalysisManager and the
>>>>> ModuleAnalysisManager proxies once early on
>>>>> instead of adding a new proxy. But it seems to
>>>>> not likely work well because the analysis
>>>>> cache key type includes the function or the
>>>>> module and some pass may add a new function
>>>>> for which the proxy wouldn’t be cached. We’d
>>>>> need to write and insert a pass in select
>>>>> locations to just fill the cache. Adding the
>>>>> new proxy would take care of these with a
>>>>> three-line change.
>>>>>
>>>>> Conditional BFI
>>>>> We could consider adding a conditional BFI
>>>>> analysis that is a wrapper around BFI and
>>>>> computes BFI only if profiles are available
>>>>> (either checking the module has profile
>>>>> summary or depend on the PSI.) With this, we
>>>>> wouldn’t need to conditionally build pass
>>>>> pipelines and may work for the new pass
>>>>> manager. But a similar wouldn’t work for the
>>>>> old pass manager because we cannot
>>>>> conditionally depend on an analysis under it.
>>>> There is LazyBlockFrequencyInfo.
>>>> Not sure how well it fits this idea.
>>>>
>>>>
>>>> Good point. LazyBlockFrequencyInfo seems usable
>>>> with the old pass manager (save unnecessary
>>>> BFI/BPI) and would work for function passes. I
>>>> think the restriction still applies - a loop pass
>>>> cannot still request (outer-scope) BFI, lazy or
>>>> not, new or old (pass manager). Another assumption
>>>> is that it'd be cheap and safe to unconditionally
>>>> depend on PSI or check the module's profile summary.
>>>>
>>>>
>>>> regards,
>>>> Fedor.
>>>>
>>>>>
>>>>>
>>>>> [1] We cannot call AnalysisManager::getResult
>>>>> for an outer scope but only getCachedResult.
>>>>> Probably because of potential pipelining or
>>>>> concurrency issues.
>>>>> [2] For example, potentially breaking up
>>>>> multiple pipelined loop passes and insert
>>>>> RequireAnalysisPass<BlockFrequencyAnalysis> in
>>>>> front of each of them.
>>>>> [3] For example, -fprofile-instr-use and
>>>>> -fprofile-sample-use aren’t present in ThinLTO
>>>>> post link builds.
>>>>> [4] For example, we could check whether the
>>>>> module has the profile summary metadata
>>>>> annotated when building pass pipelines but we
>>>>> don’t always pass the module down to the place
>>>>> where we build pass pipelines.
>>>>> [5] By inserting
>>>>> RequireAnalysisPass<ProfileSummaryInfo> after
>>>>> the PGOInstrumentationUse and the
>>>>> SampleProfileLoaderPass passes (and around the
>>>>> PGOIndirectCallPromotion pass for the Thin LTO
>>>>> post link pipeline.)
>>>>> [6] For example, the context-sensitive PGO.
>>>>> [7] Directly calling its constructor along
>>>>> with the dependent analyses results, eg. the
>>>>> jump threading pass.
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> <mailto:llvm-dev at lists.llvm.org>
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190314/828db884/attachment-0001.html>
More information about the llvm-dev
mailing list