[PATCH] D22778: Add Loop Sink pass to reverse the LICM based of basic block frequency.

Wed Oct 19 00:49:30 PDT 2016

chandlerc added a comment.

This is really close. Some minor nit picks and a few more interesting comments below.

================
Comment at: include/llvm/Transforms/Utils/LoopUtils.h:472-474
+/// If SafetyInfo is nullptr, we are checking for sinking instructions from
+/// preheader to loop body (no speculation).
+/// If SafetyInfo is not nullptr, we are checking for hoisting/sinking
----------------
chandlerc wrote:
> Here and elsewhere in comments, I would just say "is null" rather than "is nullptr".
You changed one 'nullptr' to 'null' but missed the other.

================
Comment at: lib/Transforms/Scalar/LoopSink.cpp:11-12
+// This pass does the inverse transformation of what LICM does.
+// It traverses all of the instructions in the loop's preheader and sink
+// them/ to the loop body where frequency is lower than the loop's preheader.
+// This pass is a reverse-transformation of LICM. It differs from the Sink
----------------
"and sink them/ to" -> "and sinks them to"

================
Comment at: lib/Transforms/Scalar/LoopSink.cpp:101
+///
+/// /p SortedLoopBBs is used to help find the optimal sinking locations.
+/// It stores a list of BBs that is:
----------------
"/p" -> "\p"

================
Comment at: lib/Transforms/Scalar/LoopSink.cpp:159
+                            BlockFrequencyInfo &BFI) {
+  if (I.getNumUses() > MaxNumberOfUsesForSinking)
+    return false;
----------------
The number of user *instructions* isn't really the right thing to apply the threshold to as that doesn't directly change the cost.

The idea is that we need the size of `BBsToSinkInto` to be a small constant in order for the search for the coldest dominating set to be "just" linear in the number of blocks in the loop.

So while a threshold of "40" may make sense for number of user instructions, I suspect the threshold should be much smaller when applied to the size of `BBsToSinkInto`.

I also think you should add two comments about this. One, you should comment to the `findBBsToSinkInto` function clarifying the algorithmic complexity (That it O(N * M) or O(M^2) where N is SortedLoopBBs.size() and M is BBsToSinkInto.size()), and you should mention where you check this threshold that the reason is because we're going to call `findBBsToSinkInto` which would go quadratic if we didn't set a cap.

The reason for all of this is that I'm worried some future maintainer will come along and not really understand how risk it is to adjust these thresholds so I think it is important to document the implications.

I still think we will long-term need a better algorithmic approach here as I suspect we'll find annoying cases where the threshold blocks an important optimization (such as when there are way too many initial BBsToInsertInto but there are a small number of common dominating blocks). But I understand this is one of the really hard problems (its the same as optimal PHI placement and a bunch of other ones), and I don't want to hold up your patch on a comprehensive approach here.

On an unrelated note, you should also document that this threshold has a secondary function: it places an upper bound on how much code growth we may trigger here. I'd document this in particular as that seems somewhat accidental and I suspect we may long-term want a better threshold for that. I would in fact encourage you to leave a FIXME to adjust this for min_size and opt_size.

================
Comment at: lib/Transforms/Scalar/LoopSink.cpp:178-179
+      findBBsToSinkInto(L, BBs, SortedLoopBBs, DT, BFI);
+  if (BBsToSinkInto.size() == 0)
+    return false;
+
----------------
We generally prefer calling `.empty()` to testing `.size()` against zero.

================
Comment at: lib/Transforms/Scalar/LoopSink.cpp:184
+                             BBsToSinkInto.end());
+  std::stable_sort(SortedBBsToSinkInto.begin(), SortedBBsToSinkInto.end(),
+                   [&](BasicBlock *A, BasicBlock *B) {
----------------
Here you don't need stable sort since this is a total ordering. You should just use std::sort and mention that this is a known total ordering in a comment. You could do that in an overall comment htat explains what you're doing here:

  // Copy the final BBs into a vector and sort them using the total ordering
  // of the loop block numbers as iterating the set doesn't give a useful
  // order. No need to stable sort as the block numbers are a total ordering.

================
Comment at: lib/Transforms/Scalar/LoopSink.cpp:194
+
+  for (++BI; BI != BBsToSinkInto.end(); ++BI) {
+    BasicBlock *N = *BI;
----------------
You didn't actually switch to the sorted list here.

Also, you can just use a range based for loop here.

================
Comment at: lib/Transforms/Scalar/LoopSink.cpp:218
+/// sum frequency of inserted copy is smaller than preheader's frequency.
+bool sinkLoop(Loop &L, AAResults &AA, LoopInfo &LI, DominatorTree &DT,
+              BlockFrequencyInfo &BFI, ScalarEvolution *SE) {
----------------
This routine doesn't sink the loop, so a name `sinkLoop` seems confusing. Maybe `sinkLoopInvariantInstructions`?

Also, I think this should be a static function.

================
Comment at: lib/Transforms/Scalar/LoopSink.cpp:241-246
+  SmallVector<BasicBlock *, 10> SortedLoopBBs;
+  DenseMap<BasicBlock *, int> LoopBlockNumber;
+  int i = 0;
+  for (BasicBlock *B : L.blocks())
+    if (BFI.getBlockFreq(B) <= BFI.getBlockFreq(L.getLoopPreheader())) {
+      SortedLoopBBs.push_back(B);
----------------
It wasn't obvious to me reading this that `SortedLoopBBs` only contained the oop BBs that are less hot than the preheader. I think it might be nice to clue the reader in that this isn't *all* the loop BBs. Maybe `SortedColdLoopBBs`? Or just `ColdLoopBBs`? If you make this change, I'd keep the name consistent throughout of course.

Also, you use `<=` here, but `<` everywhere else I see, any particular reason to include BBs in this list with the same frequency?

https://reviews.llvm.org/D22778