[PATCH] D22778: Add Loop Sink pass to reverse the LICM based of basic block frequency.
Chandler Carruth via llvm-commits
llvm-commits at lists.llvm.org
Mon Oct 17 13:04:22 PDT 2016
chandlerc added a comment.
Will try to make a full pass through, thanks for the extensive updates Dehao! One specific point of discussion below:
Comment at: lib/Transforms/Scalar/LoopSink.cpp:148-170
+ // For every iteration:
+ // * Pick the ColdestBB from SortedLoopBBs
+ // * Find the set BBsDominatedByColdestBB that satisfy:
+ // - BBsDominatedByColdestBB is a subset of BBsToSinkInto
+ // - Every BB in BBsDominatedByColdestBB is dominated by ColdestBB
+ // * If Freq(ColdestBB) < Freq(BBsDominatedByColdestBB), remove
+ // BBsDominatedByColdestBB from BBsToSinkInto, add ColdestBB to
> chandlerc wrote:
> > This seems really expensive. By my reading this is O(N * L * M) where L is the number of basic blocks in a loop, N is the number of instructions we try to sink into that loop, and M is the number of basic blocks within the loop that use the instructions. If there is for example one hot basic block in the loop and a large number of cold basic blocks and all of the uses are in those cold basic blocks, it seems like this could become quite large.
> > Have you looked at other algorithms? Is there a particular reason to go with this one? (I've not thought about the problem very closely yet...)
> I initially started with an adhoc algorithm which is O(L * M), but Danny pointed out it is not optimal, so I changed to this optimal algorithm. The lower bound for any sinking algorithm is O(L*M), but if optimal solution is desired, O(N*L*M) is the best I can get.
> Yes, this could be expensive when N is large. I practice, I did not see noticeable compile time increase in speccpu2006 benchmarks after applying this patch (and enable the pass in frontend).
> How about we limit the N to be no more than a certain number to avoid expensive computation in extreme cases?
I'm not worried about N being large. I'm worried about the fact that L is >= to M and so it is quadratic in the number of basic blocks that use each instruction.
The other thing is that if this scales in compile time by N then it scales in compile time *by how much effect it is having*. If it scales in compile time by M^2, then we pay more and more compile time as loops get larger even if we only sink very few instructions.
I would either bound M to a small number, and/or look for some way to not have this be quadratic. It seems like a bunch of this should be pre-computable for the loop?
More information about the llvm-commits