[PATCH] D16381: Infrastructure to allow use of PGO in inliner

Tue Mar 8 04:20:01 PST 2016

chandlerc added a comment.

First and foremost, sorry about snapping earlier. I shouldn't have done that, I was frustrated and not communicating very effectively. Thanks to Sean and Hal and others who wrote constructive and helpful emails to get this back on the rails. Secondly, sorry that I've neglected this patch for so long. I kept prioritizing working on the actual pass manager stuff over it, and I should have at least written this up.

So I do have some serious concerns with the approach. I think it is probably a nice way to prototype things and understand their impact (which is very important!) but there are several aspects of the design that seem to be the wrong approach. I've detailed them below.

I've said a few times before, but I continue to think that while some of the issues I've raised below could be addressed, that may not be the best use of time. I think it would probably be more effective to work on helping with the pass manager effort in order to make that effort finish more rapidly. There is a large amount of work that is essentially unblocked right now to help port function passes over. All of this is very easily parallelized.

Once that is done, I would work on designing a good interprocedural profile analysis interface with a good update API, and then teach the inliner to use that. I think that the process of porting the inliner to the new pass manager will end up being a fairly substantial change (essentially a re-write) which will make integrating this much cleaner and easier.

I'm somewhat worried that in its current form, this change would actually make it *harder* to port the inliner and more work to introduce the correct long-term design.

Below are detailed comments about the design issues, but also quite a few comments about serious API, code, and implementation issues with the patch as it currently stands. I think it still needs substantial work to go into the tree, even if the high level design concern above is addressed.


================
Comment at: llvm/trunk/include/llvm/Analysis/InlineCost.h:42-55
@@ -40,1 +41,16 @@
 
+/// \brief Block frequency analysis for multiple functions.
+/// This class mimics block frequency analysis on CGSCC level. Block frequency
+/// info is computed on demand and cached unless they are invalidated.
+class BlockFrequencyAnalysis {
+private:
+  DenseMap<Function *, BlockFrequencyInfo *> BFM;
+
+public:
+  ~BlockFrequencyAnalysis();
+  /// \brief Returns BlockFrequencyInfo for a function.
+  BlockFrequencyInfo *getBlockFrequencyInfo(Function *);
+  /// \brief Invalidates block frequency info for a function.
+  void invalidateBlockFrequencyInfo(Function *);
+};
+
----------------
This is pretty much re-implementing a tiny part of the new pass manager inside the inline cost analysis. That really feels like the wrong approach to me. Notably, this is even named exactly the same as we would want to name a port of BlockFrequencyInfo to the new pass manager.

I don't agree with this approach. Fundamentally, this is not the right long-term design for how the inliner should access profile information, and it actually obstructs getting the right design in place.

================
Comment at: llvm/trunk/include/llvm/Analysis/InlineCost.h:151-152
@@ -132,1 +150,4 @@
+
+/// \brief Return estimated count of the block \p BB.
+Optional<uint64_t> getBlockCount(BasicBlock *BB, BlockFrequencyAnalysis *BFA);
 }
----------------
This seems like a surprising API to expose at the top level from the inline cost analysis.

================
Comment at: llvm/trunk/include/llvm/Transforms/IPO/InlinerPass.h:30-37
@@ -28,1 +29,10 @@
 
+// Functor invoked when a block is cloned during inlining.
+typedef std::function<void(const BasicBlock *, const BasicBlock *)>
+    BlockCloningFunctor;
+// Functor invoked when a function is inlined inside the basic block
+// containing the call.
+typedef std::function<void(BasicBlock *, Function *)> FunctionCloningFunctor;
+// Functor invoked when a function gets deleted during inlining.
+typedef std::function<void(Function *)> FunctionDeletedFunctor;
+
----------------
These functor typedefs are a bit confusing. First off, std::function is very slow. Is this overhead going to cause a compile time issue?

The second two of these typedefs aren't used, please don't leave dead code in the patch. 

Lastly, having them at the global scope seems like a bad design. It seems like they should be part of some API instead of just being global?

================
Comment at: llvm/trunk/include/llvm/Transforms/IPO/InlinerPass.h:82
@@ -71,1 +81,3 @@
   bool shouldInline(CallSite CS);
+  /// Set the BFI of \p Dst to be the same as \p Src.
+  void copyBlockFrequency(BasicBlock *Src, BasicBlock *Dst);
----------------
I would keep vertical space between declarations. It makes the comments much easier to read.

================
Comment at: llvm/trunk/include/llvm/Transforms/IPO/InlinerPass.h:102
@@ +101,3 @@
+  std::unique_ptr<BlockFrequencyAnalysis> BFA;
+  /// Are we using profile guided optimization?
+  bool HasProfileData;
----------------
I would say "Indicates whether we are using profile guided optimization." in the comment.

However, I'm surprised we need both a bool and a unique_ptr -- wouldn't the nullness of the pointer indicate the same thing as the bool?

================
Comment at: llvm/trunk/include/llvm/Transforms/Utils/Cloning.h:51-52
@@ -50,1 +50,4 @@
 
+typedef std::function<void(const BasicBlock *, const BasicBlock *)>
+    BlockCloningFunctor;
+
----------------
This is in essence an ODR violation -- two headers are defining the same typedef.

================
Comment at: llvm/trunk/include/llvm/Transforms/Utils/Cloning.h:164
@@ -161,1 +163,3 @@
+                               ClonedCodeInfo *CodeInfo = nullptr,
+                               BlockCloningFunctor Ftor = nullptr);
 
----------------
Initializing a functor from nullptr is pretty surprising. I would just use `BlockCloningFunctor Ftor = BlockCloningFunctor()` instead.

However, I think this is an indication of a larger design issue. I think the fact that you need to thread this cloning functor between very disparate layers of the code isn't good. I feel like the cloning layer should really be taught to directly support updating profile information when cloning. This of course may not be terrible reasonable to implement currently, as you have the inliner essentially maintaining a mini form of the pass manager internally and need to thread some way to update it through these layers. If that's the problem, I think it really is indicating that this needs to wait until we have the pass manager fixed so that we can layer this logic more cleanly.

================
Comment at: llvm/trunk/lib/Analysis/InlineCost.cpp:585-588
@@ -574,1 +584,6 @@
 
+// Adjust the threshold based on callsite hotness. Currently this is a nop.
+int CallAnalyzer::getAdjustedThreshold(int Threshold,
+                                       Optional<uint64_t> CallSiteCount
+                                       __attribute__((unused))) {
+  // FIXME: The new threshold should be computed from the given Threshold and
----------------
It's a bit odd that this isn't a doxygen comment. Also that this is marked as "unused". What's the plan here?

================
Comment at: llvm/trunk/lib/Analysis/InlineCost.cpp:618-620
@@ -598,2 +617,5 @@
   }
+  Optional<uint64_t> CallSiteCount =
+      llvm::getBlockCount(CS.getInstruction()->getParent(), BFA);
+  Threshold = getAdjustedThreshold(Threshold, CallSiteCount);
 
----------------
Separating this out into two functions doesn't make a lot of sense here. You now have two parts of the code handling profile based inlining adjustments.

I think you should keep all of the profile based adjustments inline here, and integrate it with the function count logic above.

================
Comment at: llvm/trunk/lib/Analysis/InlineCost.cpp:1570-1576
@@ +1569,9 @@
+  Function *F = BB->getParent();
+  Optional<uint64_t> EntryCount = F->getEntryCount();
+  if (!EntryCount)
+    return None;
+  BlockFrequencyInfo *BFI = BFA->getBlockFrequencyInfo(F);
+  uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency();
+  uint64_t FunctionEntryFreq = BFI->getEntryFreq();
+  uint64_t BBCount = EntryCount.getValue() * BBFreq / FunctionEntryFreq;
+  return BBCount;
----------------
So, inside this routine, you're actually implementing a proper interprocedural profile analysis. I really think this needs to be extracted to provide a high level analysis API around interprocedurally weighted profile information.

================
Comment at: llvm/trunk/lib/Analysis/InlineCost.cpp:1591-1595
@@ +1590,7 @@
+    return Iter->second;
+  // We need to create a BlockFrequencyInfo object for F and store it.
+  DominatorTree DT;
+  DT.recalculate(*F);
+  LoopInfo LI(DT);
+  BranchProbabilityInfo BPI(*F, LI);
+  BlockFrequencyInfo *BFI = new BlockFrequencyInfo(*F, BPI, LI);
----------------
While it happens that BPI doesn't keep any references or pointers back into the dominator tree or loop info, the APIs of these analyses actually would make that OK to do... Which makes this code a rather subtle mini implementation of the core logic of the pass manager but inside the inline cost analysis. That worries me some. At the very least it would need a lot of documentation to make it clear exactly what was going on and why this was a safe thing to do.

================
Comment at: llvm/trunk/lib/Transforms/IPO/Inliner.cpp:376-380
@@ +375,7 @@
+      BFA->getBlockFrequencyInfo(CallBB->getParent());
+  // Find the number of times OrigBB is executed per invocation of the callee
+  // and multiply by the number of times callee is executed in the caller.
+  // Freq(NewBB) = Freq(OrigBB) * CallSiteFreq / CalleeEntryFreq.
+  uint64_t CallSiteFreq = CallerBFI->getBlockFreq(CallBB).getFrequency();
+  uint64_t CalleeEntryFreq = CalleeBFI->getEntryFreq();
+  // Frequency of OrigBB in the callee.
----------------
You're dividing by CalleeEntryFreq because the original value was scaled up by that. This essentially strips some precision off. It would seem better to instead directly access the unscaled block frequency for the old BB, and then scale it correctly for the post-inlined state.

================
Comment at: llvm/trunk/lib/Transforms/IPO/Inliner.cpp:383-384
@@ +382,4 @@
+  BlockFrequency OrigBBFreq = CalleeBFI->getBlockFreq(OrigBB);
+  CallerBFI->setBlockFreq(NewBB, (double)(OrigBBFreq.getFrequency()) /
+                                     CalleeEntryFreq * CallSiteFreq);
+}
----------------
We have worked very hard throughout the PGO analysis infrastructure to not rely on actual hardware floating point. We have a broad collection of carefully written math utilities for this. One part of factoring this logic into a real IPO profile update routine would be to use the same facilities and infrastructure that the existing profile (and profile update) analyses use for this purpose.

================
Comment at: llvm/trunk/lib/Transforms/IPO/Inliner.cpp:570
@@ +569,3 @@
+        if (HasProfileData)
+          BCF = std::bind(&Inliner::updateBlockFreq, this, CS, _1, _2);
+        InlineFunctionInfo InlineInfo(&CG, ACT, BCF);
----------------
Please use lambdas instead of std::bind -- they make the code much more readable as it is easier to understand how they behave. It also avoids the pain of using placeholder values.

================
Comment at: llvm/trunk/lib/Transforms/IPO/Inliner.cpp:759
@@ +758,3 @@
+    Function *F = CG.removeFunctionFromModule(CGN);
+    invalidateBFI(F);
+    delete F;
----------------
I'm really concerned by the need to manually invalidate BFI in so many places. I think this is going to prove to be a very error prone pattern to maintain long term.


Repository:
  rL LLVM

http://reviews.llvm.org/D16381