[PATCH] D22778: Add Loop Sink pass to reverse the LICM based of basic block frequency.

Mon Oct 17 13:04:22 PDT 2016

chandlerc added a comment.

Will try to make a full pass through, thanks for the extensive updates Dehao! One specific point of discussion below:

================
Comment at: lib/Transforms/Scalar/LoopSink.cpp:148-170
+  // For every iteration:
+  //   * Pick the ColdestBB from SortedLoopBBs
+  //   * Find the set BBsDominatedByColdestBB that satisfy:
+  //     - BBsDominatedByColdestBB is a subset of BBsToSinkInto
+  //     - Every BB in BBsDominatedByColdestBB is dominated by ColdestBB
+  //   * If Freq(ColdestBB) < Freq(BBsDominatedByColdestBB), remove
+  //     BBsDominatedByColdestBB from BBsToSinkInto, add ColdestBB to
----------------
danielcdh wrote:
> chandlerc wrote:
> > This seems really expensive. By my reading this is O(N * L * M) where L is the  number of basic blocks in a loop, N is the number of instructions we try to sink into that loop, and M is the number of basic blocks within the loop that use the instructions. If there is for example one hot basic block in the loop and a large number of cold basic blocks and all of the uses are in those cold basic blocks, it seems like this could become quite large.
> > 
> > Have you looked at other algorithms? Is there a particular reason to go with this one? (I've not thought about the problem very closely yet...)
> I initially started with an adhoc algorithm which is O(L * M), but Danny pointed out it is not optimal, so I changed to this optimal algorithm. The lower bound for any sinking algorithm is O(L*M), but if optimal solution is desired, O(N*L*M) is the best I can get.
> 
> Yes, this could be expensive when N is large. I practice, I did not see noticeable compile time increase in speccpu2006 benchmarks after applying this patch (and enable the pass in frontend).
> 
> How about we limit the N to be no more than a certain number to avoid expensive computation in extreme cases?
I'm not worried about N being large. I'm worried about the fact that L is >= to M and so it is quadratic in the number of basic blocks that use each instruction.

The other thing is that if this scales in compile time by N then it scales in compile time *by how much effect it is having*. If it scales in compile time by M^2, then we pay more and more compile time as loops get larger even if we only sink very few instructions.

I would either bound M to a small number, and/or look for some way to not have this be quadratic. It seems like a bunch of this should be pre-computable for the loop?

https://reviews.llvm.org/D22778