[PATCH] D121862: [ProfSampleLoader] When disable-sample-loader-inlining is true, merge profiles of inlined instances to outlining versions.

Mon Jun 6 15:27:45 PDT 2022

kazu added a comment.

@hoy, @wenlei Sorry for the extremely delayed reply to your questions.  Here are some thoughts between @davidxl and myself:

**Quick background**

I enabled the cost-benefit analysis in early 2021 for instrumentation FDO.  It gives us performance gains by inlining big but very hot call sites that would be rejected by the simple size-based threshold.  At the same time, we keep the combined size of .text.hot and .text largely the same by rejecting small but marginally hot call sites.  In the end, we reduce the call instruction frequency -- the number of call instructions exeucted per 1000 retired instructions.

Analyzing our large internal benchmark reveals several problems of AutoFDO relative to instrumentation FDO:

- The AutoFDO binary performs worse.

- The combined size of .text.hot and .text of the AutoFDO executable is 8 times as big as that of the FDO executable (even with the machine function splitting turned off).

- The AutoFDO binary is more frontend bound and causes more i-cache and iTLB misses on x86.  That is, the backend is sitting idle, waiting for decoded instructions.

- The AutoFDO binary invokes call instructions more often than the FDO binary (even with the cost-benefit analysis disabled).

So, all in all, it's clear that we are inlining a less-than-ideal set of call sites.

**Context similarity analysis**

I'm exploring the opposite of what you are exploring -- shifting some inlining from the sample loader inliner to the SCC inliner with the cost benefit analysis enabled, which is disabled for AutoFDO for now.

In my experiments, I don't get consistent performance wins from simply tuning down the sample loader inliner with increased thresholds on sample counts and enabling the cost-benefit analysis for the SCC inliner.  So, I am wondering if we could intelligently tune down the sample loader inliner -- inlining context-sensitive callees in the sample loader inliner and leaving the rest to the SCC inliner.  Note that the sample loader inliner is the only place where we can take advantage of context sensitivity.  Once we flatten the profile of a given callee, we lose information on context sensitivity.

I did some analysis on context sensitivity with clang (as an application) and our large internal benchmark.  It turns out that inlining a context-sensitive function to its immediate caller will allow us to take advantage of most of the context sensitivity. Specifically, in clang, only 33% of functions have a single behavior. Now, given A->B, if we hypothetically created a copy of B and named it B_called_by_A for every callee at the source code level, 91% of functions would have a single behavior.  Our large internal benchmark is similar to clang in this aspect.

**Module inliner**

I am planning to explore the possibility of (largely) replacing the SCC inliner with the module inliner.  Basically, we would inline call sites in the descending order of profitability -- most likely the ratio of cycle savings to size costs.

My plan is to try it out with instrumenation FDO first as we have very accurate (but context insensitive) profile information.

**Prologue/epilogue analysis**

@davidxl mentioned this, so I might as well expand it a little bit here.  We do spend cycles on prologue and epilogue, but we do not take that into account in inlining.  Specifically, given A->B->C, inlining C into B could make B's prologue/epilogue bigger because of increased register usage.  If B doesn't call C often enough, then B's bigger prologue/epilogue could slow down things when A calls B.

There is a room for improvement in this area, but it's hard to capture that in the SCC inliner.  If we avoid inlining C into B because of the prologue/epilogue size concerns, but B later gets inlined A, then we worry about the prologue/epilogue size in vain.  We need an inliner where we don't simply discard call sites that do not look profitable currently during inlining.

I'm hoping that the module inliner fits the bill here.  A call site that does not look profitable curretly simply stays in the priority queue.  If B->C does not look profitable enough now, we might inline A->B first, which may make A's prologue/epilogue big enough.  At that point, inlining C into A (with B inlined into it) may cause no harm to A's prologue/epilogue.

**Sample loader inliner and module inliner**

This combination is pretty far into the future in my current plan, but the core idea will probably stay the same.  Let the sample loader inliner inline profitable context-sensitive call sites, leaving the rest to the module inliner (as opposed to the SCC inliner).

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121862/new/

https://reviews.llvm.org/D121862