[PATCH] D158891: [CSSPGO] Retire FlattenProfileForMatching

Wed Aug 30 14:17:44 PDT 2023

wlei added inline comments.

================
Comment at: llvm/lib/Transforms/IPO/SampleProfile.cpp:2402
+    // Use top-level nested FS for counting profile mismatch metrics since
+    // currently once a callsite is mismatched, all its children profile are
+    // dropped.
----------------
wenlei wrote:
> hoy wrote:
> > hoy wrote:
> > > wlei wrote:
> > > > hoy wrote:
> > > > > wlei wrote:
> > > > > > hoy wrote:
> > > > > > > wlei wrote:
> > > > > > > > hoy wrote:
> > > > > > > > > nit: profile -> profiles
> > > > > > > > > 
> > > > > > > > > > once a callsite is mismatched, all its children profile are drooped
> > > > > > > > > 
> > > > > > > > > Why children profiles can be dropped? Counting happens before inlining right?
> > > > > > > > yeah, In the current  `findCalleeFunctionSamples` implementation, if the mismatch happens, it would return null, but there is no extra process to reuse those samples, so it's "dropped".
> > > > > > > > 
> > > > > > > > I think as Wenlei has proposed a solution in other patch: 
> > > > > > > > >Maybe what we should actually do is to reuse the nested profile from those mismatched parent functions, something like run promoteMergeNotInlinedContextSamples on top level functions samples not used? We do reuse nested profile for not inlined callees from parent function.
> > > > > > > > 
> > > > > > > > That sounds reasonable, but as we talked offline, we could leave it in future.
> > > > > > > I'm a bit confused. `FlattenedProfiles` is immutable once initialized right? What do we lose when using `FSFlattened` for `countProfileMismatches`?
> > > > > > > 
> > > > > > > If we are not to count for external functions, the nested callees there will be skipped too? But using a flattened profile can keep them, if they are not external.
> > > > > > Let me show you with an example, say we have a profile like
> > > > > > ```
> > > > > > A:102:1
> > > > > >  1:1
> > > > > >  2:B: 101:1
> > > > > >    1:1
> > > > > >    2:C:100:1
> > > > > >      1:1
> > > > > >       ...
> > > > > >      100:1
> > > > > > ```
> > > > > > Say callsite B in A is changed to other location and the callsite is mismatched.  Here like in this patch, it use the non-flattened profile.
> > > > > > the mismached Samples is 101, i,e, it not only count B's profile but also all the children profiles(`C`)
> > > > > > 
> > > > > > But if we use flattened profile:
> > > > > > ```
> > > > > > A:2:1
> > > > > >  1:1
> > > > > >  1: B:1
> > > > > > B:2:1
> > > > > >  1:1
> > > > > >  1: C:1
> > > > > > C:100:1
> > > > > >  1:1
> > > > > >   ...
> > > > > >   100:1
> > > > > > ```
> > > > > > The callsite of B in A is mismatched, but now the callsite count is 1, the mismatched Samples is 1.
> > > > > > 
> > > > > > I know it feels counter-intuitive, there is no mismatch in B and C, but that is the current implementation, we don't reuse the children profile. Once the parent callsite is mismatched, all the children profiles are dropped.
> > > > > > 
> > > > > >  
> > > > > > 
> > > > > Thanks for the explanation. I have a better understanding now. 
> > > > > 
> > > > > But I still have a question about how to count mismatched samples. In your examples, the 101 samples for B & C may still be used if B & C have no source changes. Yes, I get that the inlining won't happen, but even if it happens it only saves the cost of one call instruction and perhaps corresponding parameter passing cost? In other words, whether the 101 samples can be applied to inlined B&C depends on how well the matching works for B & C?
> > > > > 
> > > > > BTW, how much difference in terms of the staleness number did you see with switching to using nested profile?
> > > > Sorry I may not fully understand your question.  Here the metrics is computed before the fuzzy matching, i,e. if it's a mismatch in the source code, even later the fuzzy matching recover to the right location, we still count that location is a mismatch. We haven't done the post fuzzy matching metrics.
> > > > 
> > > > > In your examples, the 101 samples for B & C may still be used if B & C have no source changes.
> > > > This can't happen in the current implementation, if the mismatch happens, the 101 samples is never used no matter whether B&C is inlined/matched or not, because In A, when `findCalleeFunctionSamples ` return null, it won't visit A's children profiles, the children B &C profile won't be merged into the base B & C profile.
> > > > 
> > > > > BTW, how much difference in terms of the staleness number did you see with switching to using nested profile?
> > > > It should matter, I saw a case about 2 times differences, it depends on how hot the dropped children profile is. 
> > > > This can't happen in the current implementation, if the mismatch happens, the 101 samples is never used no matter whether B&C is inlined/matched or not, because In A, when findCalleeFunctionSamples  return null, it won't visit A's children profiles, the children B &C profile won't be merged into the base B & C profile.
> > > 
> > > I see. It makes sense now. A callee profile can be returned only if it is discovered at some callsite in the caller. Thanks for the explanation!
> > Left a question in the other thread https://reviews.llvm.org/D156725, but I think it may also be relavant here.
> > 
> > The question is, when and where will the nested callee samples be counted if they are not inlined but its checksum is mismatched. In this case, the callee samples can be returned during inlining.
> > 
> > It sounds like we may need to selectively flatten a profile for more accurate counting, e.g, only flattening matched callsites. Maybe what Wenlei proposed in the other patch is something similar. 
> > 
> > I feel that to be more accurate, the mismatch counter needs to more closely mimic what the sample inliner does, or the other way around, i.e, to fix the inliner to return samples for callsites that don't exist on the current IR.
> > 
> > Not sure how important the problem is though. But if we really want to be as accurate as enough, maybe combining the matcher and the inliner can get us there.
> > 
> > 
> > It sounds like we may need to selectively flatten a profile for more accurate counting, e.g, only flattening matched callsites. Maybe what Wenlei proposed in the other patch is something similar.
> 
> Making mismatch counting and inliner's use of profile consistent will make the mismatch/stale metric more meaningful. However, I think there are still things to be improved in terms of how inliner uses profile - i.e. even if top level function mismatches, nested profile may still be reused. 
> 
> My suggestion is to prioritize better use of profile first, before do everything we can to align mismatch counting. 
Yeah, I agree there are two remaining issues:

1) The inliner. Currently we don't reused the mismatched callsite's children profile. 

2) The mismatch metric. Yes, even with current fix, we only count the top-level profile's mismatched callsite, we don't count the nested inlined profile's mismatched callsite. 

IMO, the solution for #1 is we can do the similar thing as `promoteMergeNotInlinedContextSamples`.

the solution for #2 is we can do the similar things as `Compute checksum mismatch recursively on nested profile `, also count the nested profile. another solution is to extend `SampleCoverageTracker` (https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/Transforms/Utils/SampleProfileLoaderBaseUtil.h#L35)

`SampleCoverageTracker` is the thing that count the coverage(mismatch) along the sample loader(inlining), however, it only support BodySamples, we can extend the to `callsiteSamples`. 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158891/new/

https://reviews.llvm.org/D158891