[PATCH] D121643: [llvm-profgen] On-demand pseudo probe decoding

Tue Mar 22 22:21:08 PDT 2022

wenlei added inline comments.

================
Comment at: llvm/tools/llvm-profgen/ProfileGenerator.cpp:374

+void ProfileGeneratorBase::collectProfiledFunctions() {
+  std::unordered_set<const BinaryFunction *> ProfiledFunctions;
----------------
wlei wrote:
> hoy wrote:
> > wlei wrote:
> > > wlei wrote:
> > > > For CS profile,  the input contains CallStack + LBRStack, but here it only considers the addresses from LBRStack(Range + Branch) to compute profiledFunctions. Does it miss some functions whose address(frame) is only from CallStack?
> > > > 
> > > > For example, if we only have one following entry:
> > > > [main @ foo] : {rangeFrom : rangeTo}
> > > >  
> > > > Supposing rangeFrom and rangeTo only belong to `foo`(foo is not inlined to main), then the result of ProfiledFunctions only contains `foo` and misses `main`, right?
> > > > 
> > > > From the term "ProfiledFunctions", it should be `foo`, but for CS inlining, I guess we still need the frame `main`?
> > > > 
> > > Sorry, I missed your notes on the summary. 
> > > > "Note that in on-demand mode, a calling context may be truncated at a caller which does not have sample. This is a discrepancy from the complete mode. However, with the CS preinliner such context will always be truncated from the compiler inlining point of view." 
> > > 
> > > So in on-demand mode, it will implicitly trim some contexts, even before CS-preinliner or cold-context-trimming. 
> > Yes, and those contexts, if not truncated by the CS preinliner, will also be truncated by the compiler since the sample loader wouldn't consider any inlining in a non-profiled function.
> Thanks for the clarification. Then it looks like there can be  a future work to compute the ProfiledFunctions in early time and truncate the callStack during unwinding so that `SampleCounters` is reduced to save parsing time and memory.
> if not truncated by the CS preinliner, will also be truncated by the compiler since the sample loader wouldn't consider any inlining in a non-profiled function. 

Actually this is arguable. Top-down inlining could chose to also inline cold calls in certain cases. E.g. even if A->B is cold, inlining A->B->C provides an opportunity for B->C to specialize under A->B->C, without affecting D->B->C path. 

What you described could be benign at the moment, but I'm wondering if we keep main like Lei suggested, what's the extra memory overhead? Semantically, keeping the old behavior might be better because the generated profile is supposed to have complete context before preinliner/trimming. Ofc, if that behavior bloats memory too much, the way you optimize it here by trimming earlier helps.

================
Comment at: llvm/tools/llvm-profgen/ProfiledBinary.cpp:382
               reinterpret_cast<const uint8_t *>(Contents.data()),
-              Contents.size()))
+              Contents.size(), ProfiledGuids))
         exitWithError("Pseudo Probe decoder fail in .pseudo_probe section");
----------------
hoy wrote:
> wenlei wrote:
> > wenlei wrote:
> > > How about completely unify the workflow of on-demand vs full decoding? 
> > > 
> > > The only difference would be - for full decoding, we just need to provide a full Guid set; and for on-demand, we provide ProfiledGuids. 
> > > 
> > > Correct me if I'm wrong, but I think this was we keep the same behavior for full decoding as of today (no context truncation before preinliner), and we can also remove ProbeStack and replace it with AddressStack everywhere. 
> > > The only difference would be - for full decoding, we just need to provide a full Guid set; and for on-demand, we provide ProfiledGuids.
> > 
> > missed this one? By only difference, I meant the only place that need to check `DecodeProbeForProfiledFunctionsOnly`. 
> Right now for full decoding we provide an empty `ProfiledGuids` to the decoder instead of computing Guids for all disassembled functions which could be expansive.  The only place that requires full decoding is when --show-assembly-only is on. Otherwise it is kind of unified?
> 
> I'm adding an assert like this
> 
> ```
>   assert((!ProfiledFunctions.empty() || !DecodeProbeForProfiledFunctionsOnly ||
>           ShowDisassemblyOnly) &&
>          "Profiled functions should not be empty in on-demand probe decoding "
>          "mode");
> ```
> 
> We can also use a 0 as a guid to represent all functions. WDYT?
Passing empty guid for full decoding works too. Then can we only check `DecodeProbeForProfiledFunctionsOnly` within `decodePseudoProbe` and pass empty Guid when the flag is false?

Currently `DecodeProbeForProfiledFunctionsOnly` is checked at many places before calling `decodePseudoProbe`. What I meant was that regardless of that flag, `decodePseudoProbe` should always be called at the same place (except for `ShowDisassemblyOnly`), but the flag only control what is passed as GuidFilter.   

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121643/new/

https://reviews.llvm.org/D121643