[PATCH] D121643: [llvm-profgen] On-demand pseudo probe decoding

Wed Mar 23 12:38:25 PDT 2022

hoy added inline comments.

================
Comment at: llvm/tools/llvm-profgen/ProfileGenerator.cpp:374

+void ProfileGeneratorBase::collectProfiledFunctions() {
+  std::unordered_set<const BinaryFunction *> ProfiledFunctions;
----------------
wenlei wrote:
> hoy wrote:
> > wenlei wrote:
> > > hoy wrote:
> > > > wenlei wrote:
> > > > > wlei wrote:
> > > > > > hoy wrote:
> > > > > > > wlei wrote:
> > > > > > > > wlei wrote:
> > > > > > > > > For CS profile,  the input contains CallStack + LBRStack, but here it only considers the addresses from LBRStack(Range + Branch) to compute profiledFunctions. Does it miss some functions whose address(frame) is only from CallStack?
> > > > > > > > > 
> > > > > > > > > For example, if we only have one following entry:
> > > > > > > > > [main @ foo] : {rangeFrom : rangeTo}
> > > > > > > > >  
> > > > > > > > > Supposing rangeFrom and rangeTo only belong to `foo`(foo is not inlined to main), then the result of ProfiledFunctions only contains `foo` and misses `main`, right?
> > > > > > > > > 
> > > > > > > > > From the term "ProfiledFunctions", it should be `foo`, but for CS inlining, I guess we still need the frame `main`?
> > > > > > > > > 
> > > > > > > > Sorry, I missed your notes on the summary. 
> > > > > > > > > "Note that in on-demand mode, a calling context may be truncated at a caller which does not have sample. This is a discrepancy from the complete mode. However, with the CS preinliner such context will always be truncated from the compiler inlining point of view." 
> > > > > > > > 
> > > > > > > > So in on-demand mode, it will implicitly trim some contexts, even before CS-preinliner or cold-context-trimming. 
> > > > > > > Yes, and those contexts, if not truncated by the CS preinliner, will also be truncated by the compiler since the sample loader wouldn't consider any inlining in a non-profiled function.
> > > > > > Thanks for the clarification. Then it looks like there can be  a future work to compute the ProfiledFunctions in early time and truncate the callStack during unwinding so that `SampleCounters` is reduced to save parsing time and memory.
> > > > > > if not truncated by the CS preinliner, will also be truncated by the compiler since the sample loader wouldn't consider any inlining in a non-profiled function. 
> > > > > 
> > > > > Actually this is arguable. Top-down inlining could chose to also inline cold calls in certain cases. E.g. even if A->B is cold, inlining A->B->C provides an opportunity for B->C to specialize under A->B->C, without affecting D->B->C path. 
> > > > > 
> > > > > What you described could be benign at the moment, but I'm wondering if we keep main like Lei suggested, what's the extra memory overhead? Semantically, keeping the old behavior might be better because the generated profile is supposed to have complete context before preinliner/trimming. Ofc, if that behavior bloats memory too much, the way you optimize it here by trimming earlier helps.
> > > > For now the sample inliner only processes functions with profiles. So for cold function call like A->B, both A and B should have a profile in order for the call to be inlined. With this on-demand approach, if A is not sampled, we truncate the context to B only. Without this diff, we would generate a A:1 @ B like profile, but then during top-down inlining, context promotion would truncate the context as well. So for now we wouldn't lose any opportunity. 
> > > > 
> > > > This will be a problem In the future when we extend the top-down inliner to handle non-sampled functions.
> > > > 
> > > > Good point on moving the profile function computation earlier so that we could do more truncation during the unwinding to further save some memory. So far our memory usage looks good.
> > > > 
> > > > I don't have number for how many more functions will have their probes decoded if we want to keep all contexts. Maybe we could revisit this once those contexts are becoming useful?
> > > > 
> > > > So for now we wouldn't lose any opportunity. This will be a problem In the future when we extend the top-down inliner to handle non-sampled functions.
> > > Yes, we don't lose inlining now, but that's not the main concern. 
> > > 
> > > > I don't have number for how many more functions will have their probes decoded if we want to keep all contexts. Maybe we could revisit this once those contexts are becoming useful?
> > > What I was thinking was that, if we keep all context probe decoded, this change will strictly not affect output in anyways and we can remove the switch which simplifies things and always same memory.
> > > 
> > > I think that having full context also helps with verification even if we don't utilize that info for optimization now. So if the extra probes doesn't cost us much for memory, I'm leaning towards keeping them and remove the switch to make the new behavior the default and only choice.
> > > 
> > > There's another alternative: we always only decode profiled functions (except for show-assembly), then there's a switch to control whether extra context probe is decoded.
> > okay, keeping full contexts doesn't seem to incur more cost, only 5% more functions are decoded. I'm including the change here.
> Cool, in that case, maybe remove the new switch altogether? Then you can also eliminate many test changes for extra switch to keep old behavior.
Sounds good.

================
Comment at: llvm/tools/llvm-profgen/ProfiledBinary.cpp:366-368
+  std::unordered_set<uint64_t> ProfiledGuids;
+  for (auto *F : ProfiledFunctions)
+    ProfiledGuids.insert(Function::getGUID(F->FuncName));
----------------
wenlei wrote:
> hoy wrote:
> > wenlei wrote:
> > > When `DecodeProbeForProfiledFunctionsOnly` is off, are we supposed to keep `ProfiledGuids` empty here for buildAddress2ProbeMap, so all functions will be decoded?
> > The function `decodePseudoProbe` was run in two places, one was in the very early binary loading, the other was in profile generation. The first time the function is executed `ProfiledFunctions` should always be empty. The second time it's run, the container may not be empty. This is confusing, I'm now changing it to only run in profile generation except for ShowDisassemblyOnly. 
> > 
> > 
> Maybe I missed something, but the only place I see `decodePseudoProbe` run earlier is for `ShowDisassemblyOnly`. When ShowDisassemblyOnly is off, when do we call decodePseudoProbe with empty Guid?
> 
> And why do we need to run decodePseudoProbe early except for ShowDisassemblyOnly? That was part of the unification I mentioned.
That was in the previous iterations.  In the latest iteration decodePseudoProbe should be called with empty ProfiledFunctions for ShowDisassemblyOnly only.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D121643/new/

https://reviews.llvm.org/D121643