[PATCH] D115497: [Inline] Disable deferred inlining

Tue Dec 14 07:53:43 PST 2021

nikic added a comment.

In D115497#3185669 <https://reviews.llvm.org/D115497#3185669>, @mtrofin wrote:

> I believe part of the problem is that cost estimation doesn't take much of the caller into account. If it did, in the A->B->C case, A->B being too large could make A->C unprofitable.

Right. Taking the caller size into account would help to at least mitigate this general class of problems, though I think that we should avoid getting into the situation in the first place if we can. I think one area where this can happen and is not avoidable is de-indirection during inlining. One can probably construct a case where inlining each level of calls makes the next level direct -- a caller size limit, or a limit on reprocessing of new calls, is likely the only way to combat infinite growth in that case.

> One other thing to consider about deferral, though: say A->B, C->B, B->D. Say not deferring B->D leads to B not being inlinable in either A nor C. If we deferred, then suppose B gets inlined into A, and in this context, so does D; but suppose B does not inlined into C and then nor does D in this second context (we don't revisit callsites if inlining does not occur).

Yeah, something like that can happen, though I'm not sure that's a good outcome: The C->B caller that did not get inlined would probably benefit from the B->D edge being inlined -- but that doesn't happen to enable inlining on an //unrelated// edge.

> For prudence, should we perhaps as a first step allow disabling it (and maybe the Rust frontend can then disable it by default?), but leave it on by default? We can then see if disabling it hurts performance of any real-world, PGO-ed binaries.

I went ahead and landed the option in https://github.com/llvm/llvm-project/commit/7abf299fedff2f5b22ea8180311b44feeac26a9f, so this patch is now only about flipping the default. Explicitly setting it on the rustc side would address our immediate problem, but isn't a great long-term solution and is not reliable in LTO scenarios.

In D115497#3189249 <https://reviews.llvm.org/D115497#3189249>, @mtrofin wrote:

> TL;DR; no significant performance effect
>
> I ran 2 things:
>
> - intenal microbenchmarks we use for compiler release, which are using `google benchmark`. I used the perf counter support in that library, and measured `INSTRUCTIONS` and `CYCLES` for a subset that I pre-verified has very low `INSTRUCTIONS` variation (i.e. around 0.2%). The geomean is flat. Per-benchmark, I saw variations sub-2% for either of the 2 counters.
> - an internal benchmark for a real application - no significant change

Thanks! That looks promising and about matches what I saw.

In D115497#3185931 <https://reviews.llvm.org/D115497#3185931>, @kazu wrote:

> Can you achieve the same effect with `-mllvm -inline-deferral-scale=0`?  If so, I would be inclined to keeping the mechanism as is for now.
>
> As Mircea suggests, I am OK with disabling it by default after some benchmarking.  We could even remove it altogether if nobody complains for a while.

`-inline-deferral-scale=0` reduces the amount of deferred inlining, but doesn't disable it entirely, and doesn't help in the case I'm looking at. This is because the last call bonus can result in a large negative cost for deferred inlining, which will then end up being preferred independently of the deferral scale.

In D115497#3188281 <https://reviews.llvm.org/D115497#3188281>, @wenlei wrote:

> Are you using newpm with PGO? FWIW, we've also ran into bloated inlining multiple times after switching to npm a while back, but most of that turned out to be due to a combination of npm's better use of BFI for inline and some kind of profiling issue leading to unusual flat profile which makes inliner super aggressive (flat profile has a relative lower hot threshold so more callsites are considered hot).
>
> We haven't found problem with inline deferral so far, though we also haven't measured its perf benefit..

All reports I saw were for non-PGO builds. Though there does seem to be a relation with (static) BFI at least in my motivating case, because precision loss in frequencies results in incorrect cold callsite estimation. This ends up being the catalyst for the issue, but I don't think it's the root problem. But this is likely the reason why the motivating test case worked fine under the legacy pass manager, which does not use BFI during inlining. (There is likely a selection effect here: These problems are treated as blocking regressions because previously working code breaks -- the legacy PM inliner also had instances of catastrophic inlining, but those were largely ignored.)

In D115497#3185894 <https://reviews.llvm.org/D115497#3185894>, @davidxl wrote:

> 1. There is a a conflict between maximize runtime performance and compile time, so this should not be disabled without extensive benchmarking from downstream users

Generally true. However, when compile-time tends towards infinity it stops being an optimization tradeoff and starts being a correctness issue. The compiler is no longer able to produce a working binary.

> Why does it only happen with the new PM?

See above -- in my motivating case, I believe this is due to BFI use in the new PM, but ultimately the new PM is incidental here. It just perturbs inlining heuristics/behavior enough that previously working code breaks. The reduced test case in this patch causes catastrophic inlining in both the new and legacy PMs.

>> However, there is one instance where the inlining advisor explicitly violates the bottom-up principle: Deferred inlining tries to "defer" inlining a call if it determines that inlining the caller into all its call-sites would be more profitable. Something very important to understand about deferred inlining is that it doesn't make one inlining choice in place of another -- it effectively chooses to do both. If we have a call chain A -> B -> C and cost modelling tells us that inlining B -> C is profitable, but we defer this and instead inline A -> B first, then we'll now have a call A -> C, and the cost model will (a few special cases notwithstanding) still tell us that this is profitable. So the end result is that we inlined *both* B and C, even though under the usual cost model function B would have been too large to further inline after C has been integrated into it.
>
> I think flattening A->B->C for most of the cases is working by design.

This is working by design if the cost model tells us that this is profitable. Deferred inlining is used in cases where the cost model tells us that inlining both levels is not profitable -- and then we go ahead and do so anyway by changing the inlining order.

> The last call bonus of 15000 is certainly tunable. It models the DFE opportunity which should be callee size dependent.

Yes, this use of the last call bonus would be correct if the last call bonus is roughly equal to the callee size. It's just the current use of a hardcoded and very large last call bonus that does not make sense in this context, because it vastly overestimates the benefit of removing small functions. (The hardcoded last call bonus does make sense for its primary use in the cost model.)

>> The main alternative I see to complete removal is to change deferred inlining to an actual either/or decision. That is, to mark deferred calls as noinline so we're actually trading off one inlining decision against another, and not just adding a side-channel to the cost model to do both.
>
> No that is not the intention of the deferred inlining.

At least from the original commit message (https://github.com/llvm/llvm-project/commit/3059924bddca79e0fea5f1473d9a66985e1fda4a) and the motivating issue (https://github.com/llvm/llvm-project/issues/3345) the original idea was to inline the caller "instead", not "in addition to". But clearly this is not what the implementation ended up doing in practice.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D115497/new/

https://reviews.llvm.org/D115497