[PATCH] D22630: Loop rotation

Thu Jul 28 14:08:27 PDT 2016

eli.friedman added inline comments.

================
Comment at: llvm/lib/Transforms/Scalar/LoopRotation.cpp:319
@@ +318,3 @@
+  // Collect all nodes of the loop from header to latch.
+  BasicBlock *NewH = collectSEMEBlocks(OrigH, OrigLatch, Blocks, Exits);
+  assert(NewH);
----------------
sebpop wrote:
> eli.friedman wrote:
> > hiraditya wrote:
> > > eli.friedman wrote:
> > > > hiraditya wrote:
> > > > > sebpop wrote:
> > > > > > eli.friedman wrote:
> > > > > > > sebpop wrote:
> > > > > > > > eli.friedman wrote:
> > > > > > > > > sebpop wrote:
> > > > > > > > > > hiraditya wrote:
> > > > > > > > > > > eli.friedman wrote:
> > > > > > > > > > > > Is it possible this will clone more code?  The trunk version would only clone the header; it looks like this will clone the whole loop body excluding the latch block?
> > > > > > > > > > > Yes, this will clone all the exiting basic blocks excluding the latch block (starting from the header until all it finds a BB which is not exiting). This is required to expose the redundancies that could be removed as a result of cloning the loop body. Currently there is a flag to control how much we want to clone (RotationMaxSize), I can add more specific flags to control how many exiting blocks we want to copy.
> > > > > > > > > > > In general, it would be a good idea to clone all the exiting basic blocks to expose all the redundancies.
> > > > > > > > > > > 
> > > > > > > > > > > until all it finds a BB which is not exiting
> > > > > > > > > > To rotate the loop from Michael (loop with if-then-else followed by loop exit) we could avoid stopping at the first BB which is not exiting the loop, and to continue until all loop exit edges have been duplicated.
> > > > > > > > > > 
> > > > > > > > > This is why I'd really like to see this patch split... it would be much easier to have this discussion with a patch which is specifically "Clone more blocks in loop rotation", so we could discuss what exactly makes sense independently of the changes to the cloning algorithm.
> > > > > > > > You mean let's commit this patch as is and then extend it to catch more loops?
> > > > > > > > Or do you mean we should restrict the new codegen algorithm to only catch the loops that are currently rotated?
> > > > > > > > Can you please be a bit more specific about which parts of this patch you would like to see split?
> > > > > > > I mean a patch with just the new algorithm, which behaves essentially the same way as the trunk loop rotation pass, with followups to change the rotation heuristics.  That makes it simpler to review because the expected behavior of each patch is more obvious, and it'll be easier to bisect any regressions.  As it is, I'm digging through a gigantic patch without any clue what changes are intentional.
> > > > > > To make the new code behave like the old one, we can add some code that restricts its functionality: only duplicate code until the first loop exit.
> > > > > > I do not find it productive to walk this road.
> > > > > > 
> > > > > > Please keep in mind that this is a rewrite of the existing algorithm: it is not a refactoring patch.
> > > > > > The original implementation of loop rotation cannot be moved in small steps to the new implementation.
> > > > > > A rhetorical question: when somebody comes up with a new pass, do you expect the new pass to be split up into small steps: here is the correctness analysis part, the heuristic, and code generation?  Well, no.  So why would it be different for this patch that throws away code that has too many restrictions and adds a new implementation of the pass?
> > > > > > 
> > > > > > There may be another road if you think that would be better: we could keep the current loop rotation as is, and add a new loop rotation pass in a different file, and then deprecate the existing pass once everybody is convinced that the new code is better.  Would that work for you?  That will not reduce the size of the current patch.
> > > > > Hi Eli,
> > > > > Could you explain how that would help bisect regressions. The current patch is one complete algorithm (besides legality and profitability check), so it is not possible to reduce the size of the patch. Making it similar to the previous loop-rotation can be done by adding flags, but the size of the patch would remain the same, so bisection of regression would still have to deal with the entire patch. Also, if you ignore the include-headers and the pass-setup infrastructure, this loop-rotation is less than 400 lines of code.
> > > > In terms of the regressions, I'm mostly worried about code-size/performance regressions where we don't have a clear idea about the cause except that "something changed in loop rotation".  Aggressively rotating loops isn't obviously beneficial in complicated cases, and I'm not convinced the heuristics in this patch make sense.  The old heuristics obviously aren't perfect, but they're well-tested.
> > > > 
> > > > Back to this particular line, it's not clear why you're cloning blocks after the first loop exit.  You can sort of blindly clone a small amount of code to eliminate a useless unconditional branch at the bottom of a loop, but you need something a bit more sophisticated if you're going to clone beyond that.
> > > > In terms of the regressions, I'm mostly worried about code-size/performance regressions where we don't have a clear idea
> > > > about the cause except >that "something changed in loop rotation.
> > > > Aggressively rotating loops isn't obviously beneficial in complicated cases, and I'm not convinced the >heuristics in this patch make sense. 
> > > > The  old heuristics obviously aren't perfect, but they're well-tested.
> > > If you could point out which heuristics are ambiguous I'll try to explain/improve them.
> > > 
> > > > Back to this particular line, it's not clear why you're cloning blocks after the first loop exit. You can sort of blindly clone a small amount of code > to eliminate a useless unconditional branch at the bottom of a loop, but you need something a bit more sophisticated if you're going to clone > beyond that.
> > > Sebastian explained the intent to clone beyond the first exiting block here in the mailing list. For some reason it did not appear here:
> > > http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20160718/375069.html
> > > 
> > The primary problem is that you're just blindly cloning code without any particular expectation that it will improve anything.  In certain situations, you might get lucky (like Sebastian's example), but that doesn't make it a good idea in general.
> > blindly cloning code without any particular expectation that it will improve anything.
> Isn't loop unrolling or even function inlining in the same bucket?
> 
> From an analysis point of view (scev, loop->niter, etc.) loop rotation has more implications than one can think of: just knowing that a loop runs at least once is a very critical information (that the compiler cannot compute without loop rotation) based on which other analyses can extract more precise information than without that assumption.
To some extent, loop unrolling and function inlining make sense even if the code can't be simplified: there's inherent overhead associated with function calls and looping.  To some extent, we try to measure how much code will be simplified: both inlining and loop unrolling try to measure the effects of constant folding and a few other optimizations.  To some extent, our inlining and unrolling thresholds are set a little more aggressively than we can really justify based on what the passes actually measure...

Just peeling the first iteration of every loop has real problems.  Loop rotation should be idempotent; if we run loop rotation N times, we don't want to rotate a loop N times; otherwise, it will interact badly with the pass pipeline.  Breaking up a loop makes optimization more complicated for other passes like vectorization.  And improving the runtime of the first iteration of a loop is basically useless for a loop with a large iteration count.

Given all of that, I think loop rotation needs to show that it's actually improving a loop before it clones code.

https://reviews.llvm.org/D22630