[PATCH] D120584: [NewPM][Inliner] Mark inlined calls to functions in same SCC as callee as noinline

Mon Mar 7 00:26:49 PST 2022

aeubanks added inline comments.

================
Comment at: llvm/lib/Transforms/IPO/Inliner.cpp:952
+              if (CalleeSCC != C &&
+                  CalleeSCC == CG.lookupSCC(CG.get(*NewCallee))) {
+                ICB->addFnAttr(Attribute::NoInline);
----------------
xbolva00 wrote:
> nikic wrote:
> > wenlei wrote:
> > > davidxl wrote:
> > > > nikic wrote:
> > > > > davidxl wrote:
> > > > > > nikic wrote:
> > > > > > > davidxl wrote:
> > > > > > > > This feels like very strict. Is it possible to allow some level (under an option) of iterative inlining into the same SCC?     This is because not all cases of inlining into a non-trivial SCC will result in compile time issue due to existing threshold based stopping mechanism (i.e. the new callee in SCC is too large).
> > > > > > > > 
> > > > > > > > 
> > > > > > > This does not restrict inlining into the same SCC. What it prevents is continuously peeling off more and more calls from a *different* non-trivial SCC.
> > > > > > > 
> > > > > > > From what I can tell, this is a rare case, only two programs in llvm-test-suite show any codegen changes with this patch.
> > > > > > To clarify, my understanding of the scenario is like the following
> > > > > > 
> > > > > > 
> > > > > > foo -->   (a, b, c)    where foo calls 'a'.
> > > > > > 
> > > > > > where a, b, and c forms anon trivial SCC.
> > > > > > 
> > > > > > Without the fix, we can end up with  inlining of foo->a, foo->b, foo->c, foo->a, foo->b, .... until foo blows up (we don't have a caller limit set in the inliner).
> > > > > > 
> > > > > > With this patch, once foo-> a is inlined, the new callsite foo->b is marked as no inline.
> > > > > > 
> > > > > > Is that the right?
> > > > > Yeah, that sounds about right. I believe local inlining history will actually make sure that we only inline foo->a, foo->b, foo->c, which is what prevents this from simply inlining infinitely. But it still ends up flattening the SCC, which is particularly problematic if there are multi-edges in the SCC, in which case flattening is exponential.
> > > > > 
> > > > > D98481 forbids the foo->a inline in the first place, while this patch allows inlining foo->a, but forbids the foo->b inline. There are some cases where this kind of single-level inline is probably useful, e.g. if inlining at this call-site can prune away the recursive parts entirely.
> > > > What I am thinking is a light weight mechanism to fix this with more flexibility. Assuming the number of callsites (from outside of SCC) into non-trivial SCC is small, a map from the <caller, non-trivialSCC> pairs to inlined count can be used to track the number of iterative inlining happening between the caller and SCC. If it exceeds the limit, mark the new callsite noInline.
> > > This change is indeed better than D98481, but I'm still not sure if it is good enough from performance perspective.
> > > 
> > > I hear the argument about build speed and the earlier comments on strict bottom-up inlining, but different people have different balance between build speed and performance. Great if we can achieve both; otherwise, trying to minimize perf impact and giving user flexibility is probably better. 
> > > 
> > > Recursive inlining is weak for llvm even without this change, and this area is somewhat perf sensitive. We have observed perf gap between gcc and llvm due to less recursive inlining. We also have made changes to allow more recursive inlining with PGO outside of cgscc inliner where there's less constraints for this kind of issues. D109104 is example of more aggressive recursive inlining with PGO leading to visible perf movement even on spec.
> > > 
> > > We can try to measure this change with our internal benchmarks. But overall I agree with David that a solution that gives people flexibility would be better (either an optional inliner size/growth cap, or cross scc inline history counter like David suggested).
> > Would it be sufficient to add an option for disabling the check to get this landed? Does that add the desired flexibility?
> > 
> > To be clear, this is really not about a compile-time / performance tradeoff -- the exponential nature of the issue means that builds just don't finish in any reasonable timeframe anymore. It's a correctness problem, and as such should have been fixed months ago, even //if// there were a performance impact.
> > 
> > At this point I cannot justify delaying a fix for this critical bug anymore. If upstream is unwilling to accept a "good enough" fix, then we will make this a required patch for downstream LLVM distributors instead. I had hoped to avoid this, because it diverges upstream LLVM from distro-provided LLVM and causes work for each distributor, but at the same time I also can't have an LLVM 14.0.0 release with this issue unfixed, which is where we're headed now.
> > 
> > PS: I //am// happy to discuss more invasive ways to address this general class of problems //after// an immediate fix has landed on the release branch. Something I have been toying with is assigning an inlining penalty to inlined call-sites, based on the cost of the callee it was inlined from. This basically allows those call sites to use up remaining inlining budget from the parent function, but not more than that. This is principled, in that it avoids a cost-model bypass if inlining happens in a non-bottom-up fashion, but I suspect that it will fare worse than ad-hoc solutions in practice, because it would also penalize inlining of promoted or devirtualized calls. But this is just a side-note, as this (as well as the other alternatives we have previously discussed) are not suitable as immediate, minimally invasive, low-impact bug fixes.
> Your fix looks good enough; better fix could be added to point release 14.0.1
https://reviews.llvm.org/D121084 is a similar alternative to this where instead of marking call sites `noinline`, we make the inline cost of these sorts of calls exponentially more expensive. it's definitely not as good at preventing bloat as this patch based on testing variants of mut-rec-scc.ll, but perhaps it's good enough for real world examples?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120584/new/

https://reviews.llvm.org/D120584