[llvm-commits] [PATCH] Loop unrolling for run-time trip counts

Tue Nov 15 21:58:58 PST 2011

On Nov 9, 2011, at 2:41 PM, Brendon Cahoon wrote:
> Hi Andy,
>  
> Thanks for the reply.  Much appreciated.
>  
> I agree that the optimization may not help every target, and can cause some performance problems.  Target specific information would be a huge help.  Since our target is a VLIW, we have found the pass to be useful on several SPEC2K and EEMBC benchmarks.  We also see some improvements on ARM for some of the programs in those benchmark suites.  But, we have seen some performance regressions especially on ARM due to an increase in register spills, which we’re investigating.   I haven’t done any significant performance analysis for x86 with the patch though.
>  
> I *think* I’m updating the preserved info correctly.  I believe the dominator information gets recomputed by the existing loop unrolling code.  I will run tests with the –verify-dom-info and –verify-loop-info flags though.  I hadn’t tried that.  I’ll also try your suggestions about disabling the normal unrolling code for compile-time loops, and I can run some more tests when changing loop count.
>  
> Thanks again,
> -- Brendon
>  
> --
> Qualcomm Innovation Center, Inc is a member of Code Aurora Forum
>  
> From: Andrew Trick [mailto:atrick at apple.com] 
> Sent: Wednesday, November 09, 2011 12:36 PM
> To: Brendon Cahoon
> Subject: Re: [llvm-commits] [PATCH] Loop unrolling for run-time trip counts
>  
> On Nov 6, 2011, at 3:22 PM, Brendon Cahoon wrote:
> Hi,
>  
> This patch contains code to unroll loops that contain a run-time trip count.  It extends the existing code, in the LoopUnroll and LoopUnrollPass classes, that unrolls loops with compile-time trip counts.
>  
> The ability to unroll loops with run-time trip counts using this patch is turned off by default.  To enable the transformation, I added an option, –unroll-runtime.  It’s probably best to keep it disabled for now since some programs may degrade in performance with the option enabled.  We’re hoping that some more tuning will help minimize any performance regressions.  Of course, we do see performance improvements as well.
>  
> I tested llvm with the patch and it passes ‘make check-all’ and there are no regressions in test-suite when using the ‘simple’ test target.  I have also enabled the option and tested using test-suite, and there are no regressions with the option enabled.    With the option enabled, there will be some errors with the tests when running ‘make check-all’ for two reasons.  The existing tests assume that there is no loop unrolling for run-time trip counts, and my implementation requires that loop simplify is run afterward (I am planning on fixing this issue).  I’ve also tested the code on a different test suite for ARM and our soon to be added backend.
>  
> This implementation works by using the existing loop unrolling code to actually unroll the loop by some unroll factor (the default is 8 iterations).   This patch generates code prior to the loop to compute the number of extra iterations to execute before entering the unrolled loop.  The number of extra iterations is the run-time trip count modulo the unroll factor.  We generate code to check the number of extra iterations and branch to the extra copies of the loop body that  execute the extra iterations before entering the unrolled loop.  We generate an if-then-else sequence, which may get converted to a switch statement by LLVM.  The patch generates ‘unroll factor – 1’ copies of the loop body prior to the loop to execute these extra iterations. 
>  
> This implementation only allow unroll factors that are a power to 2 to reduce the cost of computing the number of extra iterations.  There are other limitations in implementation including only allowing loops with a single exit that occurs in the latch block.  There is certainly some room for improving the code, but it works and improves performance on some actual benchmarks.
>  
> The patch includes changes to the following files:
>    lib/Transforms/Scalar/LoopUnrollPass.cpp
>    lib/Transfoms/Utils/LoopUnroll.cpp
>    include/llvm/Transforms/Utils/UnrollLoop.h
>  
> And new files in the test directory: test/Transforms/LoopUnroll/runtime-loop[1-3].ll
>  
> I appreciate any comments, questions, etc. on the code.
>  
> Thanks,
> -- Brendon Cahoon
> --
> Qualcomm Innovation Center, Inc is a member of Code Aurora Forum
>  
>  
> Brendon,
>  
> Thanks for the patch. I see no problem adding this as an experimental framework given the testing you've done and unit tests you've provided. I suspect the optimization will be useful on some targets if branch profiles are available, logic is added to detect optimizations that can be exploited by unrolling, and/or target-specific heuristics can be queried to determine which loops may benefit.
>  
> Before I dig into code review, let me ask a high level question. Creating pre/post loops complicates the CFG. Are you sure you're updating the preserved passes properly, including DominatorTree and LoopInfo. Please run the test suite with forced runtime unrolling plus -verify-dom-info and -verify-loop-info. To further stress your framework, make sure constant trip unrolling is disabled in your runs, so your unroller processes these too. I would also try with a small unroll (e.g. 2) count to stress the loop body and a large count to stress prolog/epilog.

For the purpose of code review, I'll ignore the issue of
unroll-runtime profitability, which is obviously left for later. I
will say that this is only half the framework because to achieve
data-alignment we would need to emit an epilog as well. Also, I
misread your original message, and assumed you were adding a prolog
*loop*. That would be nice for targets that care about code
size. There are cases when unrolling the main loop can be done without
increasing the size of the loop body. Your design would not always be
well suited for those cases. On the other hand, your design is a much
simpler CFG transform. We have less SSA/LoopInfo updating to worry
about this way.

Never mind my earlier comment about DominatorTree. I forgot we always
recompute it. You're welcome to fix that :)

It does seem like your transform should preserve LoopSimplify. I'm
guessing you just need a unique loop exit block--just split the
exit. Can you please do that?

In general, your implementation looks quite nice.

I am concerned that you're doubling the size of LoopUnroll.cpp for a
feature that most people won't need to look at. I'd be ok with you
creating a new .cpp file. Or at least make it blatantly obvious that
all the new code is specific to runtime-unroll. For example, you can
add function-level comments like "Support for -runtime-unroll", and
code sections like:
//===----------------------------------------------------------------------===//
// Implementation of -runtime-unroll.
//===----------------------------------------------------------------------===//
<your code>
//===----------------------------------------------------------------------===//
// Driver for loop unrolling transformations.
//===----------------------------------------------------------------------===//
...

Formatting: You have two 80-col violations, but otherwise looks great.

Typo:
+ // ... loop
+ // loop prologue

---
In EmitRuntimePrologueFixup:

+///    extraiters = tripcount % loopfactor
+///    if (extraiters == 0) jump Loop:
+///    if (extraiters == loopfactor) jump L1
+///    if (extraiters == loopfactor-1) jump L2

You may want to comment that you expect a later pass to convert this
to a switch.

Before you erase PH->terminator, maybe you should assert that it has one
successor pointing to PEnd.

+ std::vector<BasicBlock*> NewBlocks(LoopBlocks.size());

No need to initialize the vector, then immediately clear it. You can use
NewBlocks.reserve(LoopBlocks.size()).

+  for (unsigned i = Count-1; i > 0; --i) {

This is a large, complex loop. To be more self-documenting, use a
meaningful variable like "leftOverIters". Then it will be clear why
you're counting down, and the nested loop won't hide the variable.

+    if (i != Count-1) {

My personal preference, I'd like to see the i == Count-1 case first
since that's what happens first.

---
In CloneLoopBlocks:

+    LPM->cloneBasicBlockSimpleAnalysis(LoopBlocks[j], NewBB, L);

I'm not sure this makes sense given that the new block is in the
parent loop. It may happen to work for the only user, LICM. Since
you're not doing anything to enforce safety in the API, I would add
comments in your code, in the API declaration, and in LICM that 'L' is
the "From" block's loop, not necessarily the "To" block's loop.

+       if (InsertTop != 0) {

Why isn't InsertTop set to PH for the first copy?

+    // Update our running map of newest clones
+    for (ValueToValueMapTy::iterator VI = VMap.begin(), VE = VMap.end();

If possible, avoid copying the map. For example, call LVMap.swap(VMap)
later, after you're done with the current loop iteration, before
RemapInstrucion. You might need more explicit handling of header phis
and latch terminators, but it's probably worth it both in efficiency
and code clarity. I realize the current unroller does this, but it
think the copying wasn't removed only because it does membership
checks on LastValueMap, which you don't need.

In runtime-loop3.ll please use FileCheck instead of grep. We're trying
to minimize dependence on external tools in the unit tests.

-Andy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20111115/2d443a69/attachment.html>