[llvm-dev] RFC: Getting ProfileSummaryInfo and BlockFrequencyInfo from various types of passes under the new pass manager
Hiroshi Yamauchi via llvm-dev
llvm-dev at lists.llvm.org
Wed Mar 13 16:52:39 PDT 2019
On Wed, Mar 13, 2019 at 4:20 PM Fedor Sergeev <fedor.sergeev at azul.com>
wrote:
>
>
> On 3/14/19 2:04 AM, Hiroshi Yamauchi wrote:
>
>
>
> On Wed, Mar 13, 2019 at 2:37 PM Fedor Sergeev <fedor.sergeev at azul.com>
> wrote:
>
>>
>> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be
>> able to get to the ModuleAnalysisManager in one step and PSI through it.
>>
>> This is just an optimization of compile-time, saves one indirection
>> through FunctionAnalysisManager.
>> I'm not even sure if it is worth the effort. And definitely not crucial
>> for the overall idea.
>>
>
> This should probably be clarified to something like:
>
> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to be
> able to get to the ModuleAnalysisManager and PSI because it may not always
> through (const) FunctionAnalysisManager, unless ModuleAnalysisManagerFunctionProxy
> is already cached.
>
> Since FunctionAnalysisManager we can get from LoopAnalysisManager is a
> const ref, we cannot call getResult on it and always get
> ModuleAnalysisManager and PSI (see below.) This actually happens in my
> experiment.
>
> SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) {
> auto &FAM = LAM.getResult<FunctionAnalysisManagerLoopProxy>(L,
> AR).getManager();
> auto *MAMProxy = FAM.getCachedResult<ModuleAnalysisManagerFunctionProxy>(
> L.getHeader()->getParent()); *// Can be null*
>
> Oh... well...
>
> If (MAMProxy) {
> auto &MAM = MAMProxy->getManager();
> auto *PSI =
> MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
> } else {
> *// Can't get MAM and PSI.*
> }
> ...
>
> ->
>
> SomeLoopPass::run(Loop &L, LoopAnalysisManager &LAM, …) {
> auto &MAM = LAM.getResult<ModuleAnalysisManagerLoopProxy>(L,
> AR).getManager(); *// Not null*
> auto *PSI = MAM.getCachedResult<ProfileSummaryAnalysis>(*F.getParent());
> ...
>
>
> AFAICT, adding ModuleAnalysisManagerLoopProxy seems to be as simple as:
>
> /// A proxy from a \c ModuleAnalysisManager to a \c Loop.
> typedef OuterAnalysisManagerProxy<ModuleAnalysisManager, Loop,
> LoopStandardAnalysisResults &>
> ModuleAnalysisManagerLoopProxy;
>
> It also needs to be added to PassBuilder::crossRegisterProxies...
> But yes, that appears to be a required action.
>
Yes, this line:
LAM.registerPass([&] { return ModuleAnalysisManagerLoopProxy(MAM); });
> regards,
> Fedor.
>
>
>
>
>
>>
>> regards,
>> Fedor.
>>
>>
>>
>>
>>
>> On Mon, Mar 4, 2019 at 2:05 PM Fedor Sergeev <fedor.sergeev at azul.com>
>> wrote:
>>
>>>
>>>
>>> On 3/4/19 10:49 PM, Hiroshi Yamauchi wrote:
>>>
>>>
>>>
>>> On Mon, Mar 4, 2019 at 10:55 AM Hiroshi Yamauchi <yamauchi at google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Sat, Mar 2, 2019 at 12:58 AM Fedor Sergeev <fedor.sergeev at azul.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On 3/2/19 2:38 AM, Hiroshi Yamauchi wrote:
>>>>>
>>>>> Here's a sketch of the proposed approach for just one pass (but
>>>>> imagine more)
>>>>>
>>>>> https://reviews.llvm.org/D58845
>>>>>
>>>>> On Fri, Mar 1, 2019 at 12:54 PM Fedor Sergeev via llvm-dev <
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> On 2/28/19 12:47 AM, Hiroshi Yamauchi via llvm-dev wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> To implement more profile-guided optimizations, we’d like to use
>>>>>> ProfileSummaryInfo (PSI) and BlockFrequencyInfo (BFI) from more passes of
>>>>>> various types, under the new pass manager.
>>>>>>
>>>>>> The following is what we came up with. Would appreciate feedback.
>>>>>> Thanks.
>>>>>>
>>>>>> Issue
>>>>>>
>>>>>> It’s not obvious (to me) how to best do this, given that we cannot
>>>>>> request an outer-scope analysis result from an inner-scope pass through
>>>>>> analysis managers [1] and that we might unnecessarily running some analyses
>>>>>> unless we conditionally build pass pipelines for PGO cases.
>>>>>>
>>>>>> Indeed, this is an intentional restriction in new pass manager, which
>>>>>> is more or less a reflection of a fundamental property of outer-inner
>>>>>> IRUnit relationship
>>>>>> and transformations/analyses run on those units. The main intent for
>>>>>> having those inner IRUnits (e.g. Loops) is to run local transformations and
>>>>>> save compile time
>>>>>> on being local to a particular small piece of IR. Loop Pass manager
>>>>>> allows you to run a whole pipeline of different transformations still
>>>>>> locally, amplifying the save.
>>>>>> As soon as you run function-level analysis from within the loop
>>>>>> pipeline you essentially break this pipelining.
>>>>>> Say, as you run your loop transformation it modifies the loop (and
>>>>>> the function) and potentially invalidates the analysis,
>>>>>> so you have to rerun your analysis again and again. Hence instead of
>>>>>> saving on compile time it ends up increasing it.
>>>>>>
>>>>>
>>>>> Exactly.
>>>>>
>>>>>
>>>>>> I have hit this issue somewhat recently with dependency of loop
>>>>>> passes on BranchProbabilityInfo.
>>>>>> (some loop passes, like IRCE can use it for profitability analysis).
>>>>>>
>>>>> The only solution that appears to be reasonable there is to teach all
>>>>>> the loops passes that need to be pipelined
>>>>>> to preserve BPI (or any other module/function-level analyses) similar
>>>>>> to how they preserve DominatorTree and
>>>>>> other "LoopStandard" analyses.
>>>>>>
>>>>>
>>>>> Is this implemented - do the loop passes preserve BPI?
>>>>>
>>>>> Nope, not implemented right now.
>>>>> One of the problems is that even loop canonicalization passes run at
>>>>> the start of loop pass manager dont preserve it
>>>>> (and at least LoopSimplifyCFG does change control flow).
>>>>>
>>>>>
>>>>> In buildFunctionSimplificationPipeline (where LoopFullUnrollPass is
>>>>> added as in the sketch), LateLoopOptimizationsEPCallbacks
>>>>> and LoopOptimizerEndEPCallbacks seem to allow some arbitrary loop passes to
>>>>> be inserted into the pipelines (via flags)?
>>>>>
>>>>> I wonder how hard it'd be to teach all the relevant loop passes to
>>>>> preserve BFI (or BPI)..
>>>>>
>>>>> Well, each time you restructure control flow around the loops you will
>>>>> have to update those extra analyses,
>>>>> pretty much the same way as DT is being updated through DomTreeUpdater.
>>>>> The trick is to design a proper update interface (and then implement
>>>>> it ;) ).
>>>>> And I have not spent enough time on this issue to get a good idea of
>>>>> what that interface would be.
>>>>>
>>>>
>>>> Hm, sounds non-trivial :) noting BFI depends on BPI.
>>>>
>>>
>>> To step back, it looks like:
>>>
>>> want to use profiles from more passes -> need to get BFI (from loop
>>> passes) -> need all the loop passes to preserve BFI.
>>>
>>> I wonder if there's no way around this.
>>>
>>> Indeed. I believe this is a general consensus here.
>>>
>>> regards,
>>> Fedor.
>>>
>>>
>>>
>>>>
>>>>> regards,
>>>>> Fedor.
>>>>>
>>>>>
>>>>>> It seems that for different types of passes to be able to get PSI and
>>>>>> BFI, we’d need to ensure PSI is cached for a non-module pass, and PSI, BFI
>>>>>> and the ModuleAnalysisManager proxy are cached for a loop pass in the pass
>>>>>> pipelines. This may mean potentially needing to insert BFI/PSI in front of
>>>>>> many passes [2]. It seems not obvious how to conditionally insert BFI for
>>>>>> PGO pipelines because there isn’t always a good flag to detect PGO cases
>>>>>> [3] or we tend to build pass pipelines before examining the code (or
>>>>>> without propagating enough info down) [4].
>>>>>>
>>>>>> Proposed approach
>>>>>>
>>>>>> - Cache PSI right after the profile summary in the IR is written in
>>>>>> the pass pipeline [5]. This would avoid the need to insert
>>>>>> RequiredAnalysisPass for PSI before each non-module pass that needs it. PSI
>>>>>> can be technically invalidated but unlikely. If it does, we insert another
>>>>>> RequiredAnalysisPass [6].
>>>>>>
>>>>>> - Conditionally insert RequireAnalysisPass for BFI, if PGO, right
>>>>>> before each loop pass that needs it. This doesn't seem avoidable because
>>>>>> BFI can be invalidated whenever the CFG changes. We detect PGO based on the
>>>>>> command line flags and/or whether the module has the profile summary
>>>>>> info (we may need to pass the module to more functions.)
>>>>>>
>>>>>> - Add a new proxy ModuleAnalysisManagerLoopProxy for a loop pass to
>>>>>> be able to get to the ModuleAnalysisManager in one step and PSI through it.
>>>>>>
>>>>>> Alternative approaches
>>>>>>
>>>>>> Dropping BFI and use PSI only
>>>>>> We could consider not using BFI and solely relying on PSI and
>>>>>> function-level profiles only (as opposed to block-level), but profile
>>>>>> precision would suffer.
>>>>>>
>>>>>> Computing BFI in-place
>>>>>> We could consider computing BFI “in-place” by directly running BFI
>>>>>> outside of the pass manager [7]. This would let us avoid using the analysis
>>>>>> manager constraints but it would still involve running an outer-scope
>>>>>> analysis from an inner-scope pass and potentially cause problems in terms
>>>>>> of pass pipelining and concurrency. Moreover, a potential downside of
>>>>>> running analyses in-place is that it won’t take advantage of cached
>>>>>> analysis results provided by the pass manager.
>>>>>>
>>>>>> Adding inner-scope versions of PSI and BFI
>>>>>> We could consider adding a function-level and loop-level PSI and
>>>>>> loop-level BFI, which internally act like their outer-scope versions but
>>>>>> provide inner-scope results only. This way, we could always call getResult
>>>>>> for PSI and BFI. However, this would still involve running an outer-scope
>>>>>> analysis from an inner-scope pass.
>>>>>>
>>>>>> Caching the FAM and the MAM proxies
>>>>>> We could consider caching the FunctionalAnalysisManager and the
>>>>>> ModuleAnalysisManager proxies once early on instead of adding a new proxy.
>>>>>> But it seems to not likely work well because the analysis cache key type
>>>>>> includes the function or the module and some pass may add a new function
>>>>>> for which the proxy wouldn’t be cached. We’d need to write and insert a
>>>>>> pass in select locations to just fill the cache. Adding the new proxy would
>>>>>> take care of these with a three-line change.
>>>>>>
>>>>>> Conditional BFI
>>>>>> We could consider adding a conditional BFI analysis that is a wrapper
>>>>>> around BFI and computes BFI only if profiles are available (either checking
>>>>>> the module has profile summary or depend on the PSI.) With this, we
>>>>>> wouldn’t need to conditionally build pass pipelines and may work for the
>>>>>> new pass manager. But a similar wouldn’t work for the old pass manager
>>>>>> because we cannot conditionally depend on an analysis under it.
>>>>>>
>>>>>> There is LazyBlockFrequencyInfo.
>>>>>> Not sure how well it fits this idea.
>>>>>>
>>>>>
>>>>> Good point. LazyBlockFrequencyInfo seems usable with the old pass
>>>>> manager (save unnecessary BFI/BPI) and would work for function passes. I
>>>>> think the restriction still applies - a loop pass cannot still
>>>>> request (outer-scope) BFI, lazy or not, new or old (pass manager). Another
>>>>> assumption is that it'd be cheap and safe to unconditionally depend
>>>>> on PSI or check the module's profile summary.
>>>>>
>>>>>
>>>>>> regards,
>>>>>> Fedor.
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1] We cannot call AnalysisManager::getResult for an outer scope but
>>>>>> only getCachedResult. Probably because of potential pipelining or
>>>>>> concurrency issues.
>>>>>> [2] For example, potentially breaking up multiple pipelined loop
>>>>>> passes and insert RequireAnalysisPass<BlockFrequencyAnalysis> in front of
>>>>>> each of them.
>>>>>> [3] For example, -fprofile-instr-use and -fprofile-sample-use aren’t
>>>>>> present in ThinLTO post link builds.
>>>>>> [4] For example, we could check whether the module has the profile
>>>>>> summary metadata annotated when building pass pipelines but we don’t always
>>>>>> pass the module down to the place where we build pass pipelines.
>>>>>> [5] By inserting RequireAnalysisPass<ProfileSummaryInfo> after the
>>>>>> PGOInstrumentationUse and the SampleProfileLoaderPass passes (and around
>>>>>> the PGOIndirectCallPromotion pass for the Thin LTO post link pipeline.)
>>>>>> [6] For example, the context-sensitive PGO.
>>>>>> [7] Directly calling its constructor along with the dependent
>>>>>> analyses results, eg. the jump threading pass.
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>
>>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190313/db4bad6a/attachment-0001.html>
More information about the llvm-dev
mailing list