[PATCH] D28331: Improve PGO support for the new inliner

Fri Jan 20 11:29:00 PST 2017

eraman marked 9 inline comments as done.
eraman added inline comments.

================
Comment at: lib/Analysis/InlineCost.cpp:48-56
 static cl::opt<int> HintThreshold(
     "inlinehint-threshold", cl::Hidden, cl::init(325),
     cl::desc("Threshold for inlining functions with inline hint"));

+static cl::opt<int>
+    ColdCallSiteThreshold("inline-cold-callsite-threshold", cl::Hidden,
+                          cl::init(45),
----------------
Prazek wrote:
> davide wrote:
> > Is it possible to introduce another `cl::opt` here to disable this mode while in PGO? It would be extremely useful for testing the benefit of this feature compared to what's already there in PGO.
> Probably passing normal inlining threshold would be the same as disabling this mode.
> 
I'm generally wary of the proliferation of options, and as Piotr mentioned you could set -hot-callsite-threshold to the same value as -hinlinhint-threshold and -inline-cold-callsite-threshold to -cold-callee-threshold to get the same effect. 

================
Comment at: lib/Analysis/InlineCost.cpp:660-661
         Threshold = MaxIfValid(Threshold, Params.HotCallSiteThreshold);
       } else if (PSI->isFunctionEntryHot(&Callee)) {
+        DEBUG(dbgs() << "Hot callee.\n");
         // If callsite hotness can not be determined, we may still know
----------------
Prazek wrote:
> Are there any plans in the nearest future to only rely on caller hotness/coldness and not on callee hotness?
Not sure what qualifies as "nearest" future but the plan is to use the same threshold for cold callee, cold callsite and -Oz and once we fully migrate to the new PM based pipeline consider only the callsite's hotness.

================
Comment at: lib/Analysis/InlineCost.cpp:1588-1589

+  // Set the ColdCallSiteThreshold knob from the -cold-callsite-threshold.
+  Params.ColdCallSiteThreshold = ColdCallSiteThreshold;
+
----------------
davide wrote:
> Nit: the option is now called `-inline-cold-callsite-threshold`
Thanks. There is inconsistency in the way options are named, but I prefer to prefix inline- everywhere and will fix the other options in a separate patch.

================
Comment at: lib/Transforms/Utils/InlineFunction.cpp:1408-1409
+  SmallPtrSet<BasicBlock *, 16> ClonedBBs;
+  for (auto &OrigBB : *Callee) {
+    BasicBlock *ClonedBB = cast_or_null<BasicBlock>(VMap[&OrigBB]);
+    if (!ClonedBB)
----------------
chandlerc wrote:
> eraman wrote:
> > chandlerc wrote:
> > > I was hoping we could do still better by walking the inlined region in the caller so we would *only* touch cloned blocks. However, there isn't a map from cloned BB to original BB, which is why you can't do that. And why you wrote the code you did. Sorry for the confusion.
> > > 
> > > Unfortunately, I think walking the entire map is better than walking the entire callee as the map only contains the inlined part. Also, in all of the other code here we walk the map to find relatively sparse and infrequent things so it seems fine. Like I said, sorry for sending you down a wrong path, I hadn't thought about the lack of clone -> original BB.
> > But the size of the value map is O(|instructions in the cloned region|) and not O(|BBs in the cloned region|). My thinking was O(|instructions in the cloned region|) likely to be greater than O(|BBs in the original callee|) and that's why you preferred this. 
> Well, we already do tons of things O(|instructions in the cloned region|) including walking the VMap in at least three other places and the original InlineCost computation. This is typically OK because we have a constant bound on it (the threshold).
> 
> But we have no realistic bound on the number of basic blocks in the callee. So it might be faster on average but have much worse long tail.
> 
> I think I'd feel safer staying within the inlined region whether its instructions or basic blocks. And if this proves hot, we should just add a reverse map (at least for basic blocks) and walk the basic blocks in the cloned region.
Ok, I'll revert back to walking the VMap.

================
Comment at: test/Transforms/Inline/function-count-update-2.ll:12
+define i32 @caller1() !prof !2 {
+  %i = call i32 @callee()
+  ret i32 %i
----------------
davidxl wrote:
> Check no calls remaining.
There is nothing interesting in terms of inlining here (ie the decision to inline here doesn't exercise any code path associated with this patch), but if you insist I'll add the checks.

================
Comment at: test/Transforms/Inline/function-count-update-3.ll:13
+; known and hence the cost gets reduced.
+; Estimated count of a->e callsite = C2 * (C1 / C3)
+; Estimated count of a->e callsite = 250 * (200 / 500) = 100
----------------
davidxl wrote:
> Should it be C2 * (C1/C4) where C4 is the entry count of function C? which happens to be the same to C3 here. C3 is irrelevant
Good catch!

================
Comment at: test/Transforms/Inline/inline-cold-callee.ll:32
   %y2 = call i32 @callee2(i32 %y1)
   %y3 = call i32 @callee1(i32 %y2)
   ret i32 %y3
----------------
davidxl wrote:
> This test case is flawed.  callee1's entry count should be a least 50 given new change.
Yes and I think the best course of action is to disable the test for the new PM.  Excepting some profile insanity, there is no way a callee is going to be cold without all callsites to the callee being cold and callsite coldness takes precedence over callee precedence (and there is a separate test for that)

https://reviews.llvm.org/D28331