[llvm] [LTO][Pipelines] Add 0 hot-caller threshold for SamplePGO + FullLTO (PR #135152)

Wed Apr 16 05:19:52 PDT 2025

tianleliu wrote:

Hi @teresajohnson Many thanks for your comment!
> Why wouldn't the inlined version of the hot function be optimized when we go through the normal optimization pipeline?

There was an example I met before in https://discourse.llvm.org/t/rfc-spgo-passpipelines-adding-instcombinepass-and-simplifycfgpass-before-sampleprofileloaderpass-when-building-in-spgo-mode/83340
The example is smax(). If we don't optimize the smax  (by eliminating "sext to i32") before it is inlined, optimization pattern of "sext+cmp+sel" in smax would be broke to firstly opt "trunc+sext" after inlined. Because in InstCombinePass, it generally goes through IR from top to bottom without considering callee function's better locality and higher priority.
Though I have fixed it by adding a max/min pattern identification in InstCombine https://github.com/llvm/llvm-project/pull/118932, I don't think it is a perfect solution. Because I think root cause of this issue is inlining happening too early in SampleProfileLoader without enough optimization for callee function. And it is hard to cover all cases, especially more complicated cases, by adding various endless pattern match. 

> The inlining done in the sample loader pass is focused on getting the best matching of the profile with the context from the profiled binary. This is easier on unoptimized functions.

Yes, inlining and sample loading cross together could help match correct profile info efficiently.

> a number of optimization passes are profile guided.

Yes, the earlier sample profiling loading, the more accurate following optimization could do.

> I'm not sure how you would get the correct (context sensitive) profile data if that inlining was deferred.

My immature thought of how to implement a context sensitive defer inline: 
1. Clone callee function (with an unique caller stack info wrapped in its function name or metadata) instead of inlining in SampelProfileLoaderPass.
2. Record all its profile info in the corresponding cloned function.
3. Inline the cloned function in general InlinePass.
For example:
define funca() {...}
define funcb() {
  call funca()   // funca is a hot call site and have profile sample in it.
}
will be translated in SampleProfileLoader to:
define funca_b() { ... // have detailed !prof info in funca_b() }
define funcb() {
   call funca_b(); // record sample head count in !prof
}
If a function A is called by several functions B, C .., it would have several cloned version as A_B, A_C..., each one has its own name (unique call stack) and own profile info (matches profiled binary). This could achieve context sensitive? And I believe most of them would be inlined in general InlinePass since they are considered as hot, so finally the cloned functions will be eliminated mostly.
In this routine, all callee functions can firstly be fully optimized before inlined, and it makes inliner and sample profile loader more independent to focus on their own thing.

https://github.com/llvm/llvm-project/pull/135152