[PATCH] D16381: Infrastructure to allow use of PGO in inliner

Tue Mar 8 14:11:14 PST 2016

eraman added a comment.

Thanks for the detailed comments Chandler. The comments related to problems around API, code and implementation are very helpful and I'll address them.

Your main concern is that this patch tries to provide the functionality of pass manager inside inline cost analysis. The intention is indeed to provide a restricted subset of pass manager's functionality for the inliner to use, but with the understanding that this is a temporary fix for inliner that should be easily replaceable when this functionality is available in PM. If this doesn't meet those goals (restricted to inliner and takes minimal effort to replace it by the new PM), I'll happily iterate on this patch to address that concern. I have responded inline to the comments related to this aspect.

Thanks,
Easwaran


================
Comment at: llvm/trunk/include/llvm/Analysis/InlineCost.h:42-55
@@ -40,1 +41,16 @@
 
+/// \brief Block frequency analysis for multiple functions.
+/// This class mimics block frequency analysis on CGSCC level. Block frequency
+/// info is computed on demand and cached unless they are invalidated.
+class BlockFrequencyAnalysis {
+private:
+  DenseMap<Function *, BlockFrequencyInfo *> BFM;
+
+public:
+  ~BlockFrequencyAnalysis();
+  /// \brief Returns BlockFrequencyInfo for a function.
+  BlockFrequencyInfo *getBlockFrequencyInfo(Function *);
+  /// \brief Invalidates block frequency info for a function.
+  void invalidateBlockFrequencyInfo(Function *);
+};
+
----------------
chandlerc wrote:
> This is pretty much re-implementing a tiny part of the new pass manager inside the inline cost analysis. That really feels like the wrong approach to me. Notably, this is even named exactly the same as we would want to name a port of BlockFrequencyInfo to the new pass manager.
> 
> I don't agree with this approach. Fundamentally, this is not the right long-term design for how the inliner should access profile information, and it actually obstructs getting the right design in place.
This is not *intended* to be a long term solution. The idea is to provide an interface similar to what the new pass manager provides so that when it lands, this code can be ripped off. I can rename it and/or move this code to a separate file if that will make the migration smoother.

================
Comment at: llvm/trunk/include/llvm/Analysis/InlineCost.h:151-152
@@ -132,1 +150,4 @@
+
+/// \brief Return estimated count of the block \p BB.
+Optional<uint64_t> getBlockCount(BasicBlock *BB, BlockFrequencyAnalysis *BFA);
 }
----------------
chandlerc wrote:
> This seems like a surprising API to expose at the top level from the inline cost analysis.
I'll move it to ProfileCommon.h. I left it here because I don't want to make anything outside of inliner to use BlockFrequencyAnalysis

================
Comment at: llvm/trunk/include/llvm/Transforms/Utils/Cloning.h:164
@@ -161,1 +163,3 @@
+                               ClonedCodeInfo *CodeInfo = nullptr,
+                               BlockCloningFunctor Ftor = nullptr);
 
----------------
chandlerc wrote:
> Initializing a functor from nullptr is pretty surprising. I would just use `BlockCloningFunctor Ftor = BlockCloningFunctor()` instead.
> 
> However, I think this is an indication of a larger design issue. I think the fact that you need to thread this cloning functor between very disparate layers of the code isn't good. I feel like the cloning layer should really be taught to directly support updating profile information when cloning. This of course may not be terrible reasonable to implement currently, as you have the inliner essentially maintaining a mini form of the pass manager internally and need to thread some way to update it through these layers. If that's the problem, I think it really is indicating that this needs to wait until we have the pass manager fixed so that we can layer this logic more cleanly.
I think clients that invoke the cloning code might want to update different analyses or choose to not update at all. IMO, sinking all that into the cloning code doesn't seem a good idea,

================
Comment at: llvm/trunk/lib/Analysis/InlineCost.cpp:1570-1576
@@ +1569,9 @@
+  Function *F = BB->getParent();
+  Optional<uint64_t> EntryCount = F->getEntryCount();
+  if (!EntryCount)
+    return None;
+  BlockFrequencyInfo *BFI = BFA->getBlockFrequencyInfo(F);
+  uint64_t BBFreq = BFI->getBlockFreq(BB).getFrequency();
+  uint64_t FunctionEntryFreq = BFI->getEntryFreq();
+  uint64_t BBCount = EntryCount.getValue() * BBFreq / FunctionEntryFreq;
+  return BBCount;
----------------
chandlerc wrote:
> So, inside this routine, you're actually implementing a proper interprocedural profile analysis. I really think this needs to be extracted to provide a high level analysis API around interprocedurally weighted profile information.
Ok. 

================
Comment at: llvm/trunk/lib/Analysis/InlineCost.cpp:1591-1595
@@ +1590,7 @@
+    return Iter->second;
+  // We need to create a BlockFrequencyInfo object for F and store it.
+  DominatorTree DT;
+  DT.recalculate(*F);
+  LoopInfo LI(DT);
+  BranchProbabilityInfo BPI(*F, LI);
+  BlockFrequencyInfo *BFI = new BlockFrequencyInfo(*F, BPI, LI);
----------------
chandlerc wrote:
> While it happens that BPI doesn't keep any references or pointers back into the dominator tree or loop info, the APIs of these analyses actually would make that OK to do... Which makes this code a rather subtle mini implementation of the core logic of the pass manager but inside the inline cost analysis. That worries me some. At the very least it would need a lot of documentation to make it clear exactly what was going on and why this was a safe thing to do.
To reiterate, the intention is to *temporarily* provide the functionality that will be provided by the PM in a way that will make the transition easier. Agreed that this is not documented at all. Will fix that.

================
Comment at: llvm/trunk/lib/Transforms/IPO/Inliner.cpp:376-380
@@ +375,7 @@
+      BFA->getBlockFrequencyInfo(CallBB->getParent());
+  // Find the number of times OrigBB is executed per invocation of the callee
+  // and multiply by the number of times callee is executed in the caller.
+  // Freq(NewBB) = Freq(OrigBB) * CallSiteFreq / CalleeEntryFreq.
+  uint64_t CallSiteFreq = CallerBFI->getBlockFreq(CallBB).getFrequency();
+  uint64_t CalleeEntryFreq = CalleeBFI->getEntryFreq();
+  // Frequency of OrigBB in the callee.
----------------
chandlerc wrote:
> You're dividing by CalleeEntryFreq because the original value was scaled up by that. This essentially strips some precision off. It would seem better to instead directly access the unscaled block frequency for the old BB, and then scale it correctly for the post-inlined state.
The unscaled block frequency is meant to be used only for the initial BFI computation right? Moreover as we incrementally update the block frequency using  setBlockFrequency interface of BFI, there may be blocks that do not have an unscaled value associated with them,

================
Comment at: llvm/trunk/lib/Transforms/IPO/Inliner.cpp:759
@@ +758,3 @@
+    Function *F = CG.removeFunctionFromModule(CGN);
+    invalidateBFI(F);
+    delete F;
----------------
chandlerc wrote:
> I'm really concerned by the need to manually invalidate BFI in so many places. I think this is going to prove to be a very error prone pattern to maintain long term.
The alternative is to conservatively invalidate BFI at the beginning of the inliner loop. That is fine with me if the increase in compilation time is ok. 


Repository:
  rL LLVM

http://reviews.llvm.org/D16381