[PATCH] D94001: [CSSPGO] Call site prioritized inlining for sample PGO

Wed Jan 13 21:57:13 PST 2021

wmi added a comment.

In D94001#2495265 <https://reviews.llvm.org/D94001#2495265>, @wenlei wrote:

> In D94001#2494594 <https://reviews.llvm.org/D94001#2494594>, @wmi wrote:
>
>> Sorry for the delay in review. It makes a lot of sense to have a priority based inliner for CSSPGO since its profile annotation doesn't rely on replaying the inlining. But I don't understand why we rely on BFS/DFS strategy to expose the hottest callsite for priority based inliner. In my mind, CSSPGO can know the hotness of each callsite with full context information. Can we just sort all the callsites  based on CSSPGO profile then try to inline the hottest callsite under a specific context first in a top down manner? We may actually need to inline some parent callsites before we can inline the hottest callsite, but it is all driven by the target to inline the hottest callsite first. If we worry some callsite is too deep and we need to inline through a deep path before we can inline the specific callsite, we may add the depth into priority computation.  What do you think?
>
> Yeah, that's one step further than what I have in this patch. The key here is the prioritized work list, and BFS is just a natural by product of using the priority queue. I called out BFS to emphasize the more balanced inlining, but it's not super accurate because it is only a true BFS when priorities are all identical. What you suggested is essentially increasing the scope of call sites to prioritize - currently it's the immediate callees. I agreed that increasing the scope to multiple levels or the entire sub call tree may be beneficial. But I don't think it conflicts the approach in the patch, instead it can be added leveraging this implementation.
>
> With context profiles organized in a trie, it's relatively easy to get the hottest call sites in sub call tree (assuming no stale profile). I think we can assign total order/priority for these hot call sites as well as the call sites in the middle leading to those hot ones (call sites that lead to hottest call site need to have their priority bumped up), then the priority queue will lead us to these hottest call sites first even if they're several levels away..   I can give it a try to increase the scope of call site for prioritization as a follow up.

Ok, thanks. About the prioritized work list, I was also expecting it will consider all the possible candidates in the module, not just in a Function, so it will be disruptive in terms of the order of profile annotation and inlining. I agree it can be a followup since it needs more experiments.

================
Comment at: llvm/include/llvm/Transforms/IPO/SampleContextTracker.h:100
+  std::vector<const FunctionSamples *>
+  getIndirectCalleeContextSamplesFor(const DILocation *DIL);
   // Query context profile for a given location. The full context
----------------
Do we expect DIL to be the debug location of the indirect call?

================
Comment at: llvm/lib/Transforms/IPO/SampleContextTracker.cpp:43
 ContextTrieNode *
 ContextTrieNode::getChildContext(const LineLocation &CallSite) {
   // CSFDO-TODO: This could be slow, change AllChildContext so we can
----------------
How about getMaxCountChildContext?

================
Comment at: llvm/lib/Transforms/IPO/SampleContextTracker.cpp:223-225
+  ContextTrieNode *CallerNode = getContextFor(DIL);
+  LineLocation CallSite = FunctionSamples::getCallSiteIdentifier(DIL);
+  for (auto &It : CallerNode->getAllChildContext()) {
----------------
Is it possible for getContextFor to return nullptr for DIL, and do you need to handle such case?

================
Comment at: llvm/lib/Transforms/IPO/SampleProfile.cpp:186
+    cl::desc(
+        "The size growth ratio limit for sample profile loader inlining."));
+
----------------
It will only be applied to priority based sample profile loader inlining, right? Same for the description of sample-profile-inline-limit-min/sample-profile-inline-limit-max

================
Comment at: llvm/lib/Transforms/IPO/SampleProfile.cpp:368
+  bool operator()(const InlineCandidate &LHS, const InlineCandidate &RHS) {
+    return LHS.CallsiteCount < RHS.CallsiteCount;
+  }
----------------
Considering the case CallsiteCounts are equal, does every InlineCandidate have non-null CalleeSamples? If that is true, FunctionSamples::GUID can be used to make the comparison more stable.

================
Comment at: llvm/lib/Transforms/IPO/SampleProfile.cpp:994

+  auto FSCompare = [](const FunctionSamples *L, const FunctionSamples *R) {
+    if (L->getEntrySamples() != R->getEntrySamples())
----------------
Add assertions to check L and R are not nullptr? 

================
Comment at: llvm/lib/Transforms/IPO/SampleProfile.cpp:1389
+  else if (CalleeSamples)
+    CallsiteCount = std::max(CallsiteCount, CalleeSamples->getEntrySamples());
+
----------------
CallsiteCount is 0 definitely before this line, so no need to use std::max. 
CallsiteCount = CalleeSamples->getEntrySamples();

================
Comment at: llvm/lib/Transforms/IPO/SampleProfile.cpp:1487
+  bool Changed = false;
+  while (!CQueue.empty() && F.getInstructionCount() < SizeLimit) {
+    InlineCandidate Candidate = CQueue.top();
----------------
Several parts in this loop looks quite similar as the counterparts in inlineHotFunctions. Is it possible to make them shared?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94001/new/

https://reviews.llvm.org/D94001