[PATCH] D121862: [ProfSampleLoader] When disable-sample-loader-inlining is true, merge profiles of inlined instances to outlining versions.

Wed Jun 8 14:48:07 PDT 2022

kazu added a comment.

@wenlei, thank *you* for sharing your insights.

> For module inliner vs sample loader inliner, I understand that it'd be more accurate to estimate size and simplification effect of inlining when done later in the pipeline, but from the current analysis, it looks like we won't miss much if we do cost-benefit analysis in sample loader, have you considered that? This way you get to leverage context sensitivity and also benefit from cost-benefit analysis.

I haven't considered doing cost-benefit analysis in the sample loader inliner, but I have observed something consistent with your comment about accuracy.  Specifically, that the cycle savings are dominated by the call site cost (that is, the cost of the call and return instructions) computed in `InlineCostCallAnalyzer::costBenefitAnalysis` in `InlineCost.cpp`:

  // Compute the total savings for the call site.
  auto *CallerBB = CandidateCall.getParent();
  BlockFrequencyInfo *CallerBFI = &(GetBFI(*(CallerBB->getParent())));
  CycleSavings += getCallsiteCost(this->CandidateCall, DL);
  CycleSavings *= CallerBFI->getBlockProfileCount(CallerBB).getValue();

I do compute the savings from conditionals being folded and such (look for a comment `Count a conditional branch as savings if it becomes unconditional.`), but that's a relatively minor addition.  For large enough applications, hot call sites are easily a factor of 100x or 1000x apart in terms of their sample counts.  Even if our size estimate is off by a factor of, say 2, because we forget to take folded branches into account, the slightly imprecise cost-benefit ratio just affects those call sites around the cost-benefit ratio threshold without significantly impacting the set of important call sites to be inlined.

My observation above is also consistent with the following.  Small changes to the cost-benefit threshold just affects, again, those call sites around the cost-benefit ratio threshold without significantly impacting the set of important call sites to be inlined.  Requiring more benefits would produce a smaller executable with comparable performance.  In concrete terms, you might set `InlineSavingsMultiplier` to 4 to 6 for instrumentation FDO applications.

> About context similarity, IIUC you experiments indicates majority of function profiles are actually not context sensitive? That is interesting. For CSSPGO on one internal workload, we see noticeable perf degradation if we try to cap context profile depth -- that indicates there's enough context sensitivity on critical path for it to matter. Maybe this is workload dependent.

No, the majority of functions (about 67% for clang) are context sensitive.  I did say that the hypothetical renaming trick (that is, one level of inlining) would take advantage of most of the context sensitivity, but that's for the entire AutoFDO application.  If we focus on the top 90% of samples or something, the picture may look quite different.  It's totally possible that our applications also require certain depths of inlining to reveal the context sensitivity.

> The prolog/epilog analysis would need to be done a bit late. It'd be interesting to see to how accurately we can model them.

IIRC, if I do the traditional iterative live variable analysis on the LLVM IR, not the machine IR, and take into account the x86-specific calling conventions (but treat all variables as integer/pointer variables for simplificy), I can accurately estimate the number of PUSH instructions 67% of the time and no more than off by one 93% of the time.  Of course, if the inline "root" keeps changing during inlining, the fairly accurate estimate is of no use.  I'll revisit this project when I start playing with the module inliner, where the inline "root" will hopefully tend to stick around, and less important call sites can wait in the priority queue instead being discarded.

> 8x total .text difference is very surprising. For us, on HHVM workload, AutoFDO/CSSPGO total .text size is only a few % larger than IRPGO using the same training. For your case, did you use the same training for both, or was IRPGO using benchmark/canary while AutoFDO was on fleet-wide profile?

The latter.  The instrumentation FDO gets trained with representative workload, while the AutoFDO executable is built with the profile collected from the instrumentation FDO executable processing the real workload.  I don't know how good an idea it is to build an AutoFDO executable this way.  At least, it's not a typical way to build AutoFDO executables.

> For sampling PGO in general, we found there's a correlation between perf improvement and size increase, even when we were tweaking selectiveness rather than aggressiveness. Similarly, we found that tuning inlining to be more aggressive than the default often leads to better perf. Wondering if you've observed something similar.

No, I haven't played with the AutoFDO parameters in isolation.  That said, I suspect (but haven't verified) that out AutoFDO exeuctable is well past your correlation between perf improvement and size increase. Knowing that the instrumentation FDO executable performs a few percent better than the AutoFDO counterpart along with other goodies (smaller .text/.text.hot footprint, reduced i-cache/iTLB misses, and reduced call instruction frequency, etc), I'm hoping to put the AutoFDO executable on another slope to a higher maximum.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121862/new/

https://reviews.llvm.org/D121862