[PATCH] Loop Rerolling Pass

Tue Oct 15 13:17:35 PDT 2013

Hi Hal, 

Thanks for working on this. The motivation for loop-rolling is clear and I agree that it has the potential for reducing code size and increasing performance. Have you considered using the SLP-vectorizer to detect consecutive/parallel statements ?  You could construct the SLP tree and use this information for the rolling transformation.  The second question I had was how many times is this triggered in the test suite ? 

Thanks,
Nadav

On Oct 15, 2013, at 12:09 PM, hfinkel at anl.gov wrote:

> Hi nadav, rengolin, atrick,
> 
> I've created a loop rerolling pass. The transformation aims to take loops like this:
> 
>  for (int i = 0; i < 3200; i += 5) {
>    a[i] += alpha * b[i];
>    a[i + 1] += alpha * b[i + 1];
>    a[i + 2] += alpha * b[i + 2];
>    a[i + 3] += alpha * b[i + 3];
>    a[i + 4] += alpha * b[i + 4];
>  }
> 
> and turn them into this:
> 
>  for (int i = 0; i < 3200; ++i) {
>    a[i] += alpha * b[i];
>  }
> 
> and loops like this:
> 
>  for (int i = 0; i < 500; ++i) {
>    x[3*i] = foo(0);
>    x[3*i+1] = foo(0);
>    x[3*i+2] = foo(0);
>  }
> 
> and turn them into this:
> 
>  for (int i = 0; i < 1500; ++i) {
>    x[i] = foo(0);
>  }
> 
> There are two motivations for this transformation:
> 
> 1. Code-size reduction (especially relevant, obviously, when compiling for code size).
> 
> 2. Providing greater choice to the loop vectorizer (and generic unroller) to choose the unrolling factor (and a better ability to vectorize). The loop vectorizer can take vector lengths and register pressure into account when choosing an unrolling factor, for example, and a pre-unrolled loop limits that choice. This is especially problematic if the manual unrolling was optimized for a machine different from the current target.
> 
> The current implementation is limited to single basic-block loops only. The rerolling recognition should work regardless of how the loop iterations are intermixed within the loop body (subject to dependency and side-effect constraints), but the significant restriction is that the order of the instructions in each iteration must be identical. This seems sufficient to capture all of my current use cases.
> 
> The transformation triggers very rarely on the test suite (which I think it good, programmers should be able to leave trivial unrolling to the compiler). When I insert this pass just prior to loop vectorization, and prior to SLP vectorization (so that we prefer to reroll over SLP vectorizing), it helps:
> 
> On an Intel Xeon E5430:
> MultiSource/Benchmarks/TSVC/LoopRerolling-flt: 36% speedup (loops s351 and s353 are rerolled, s353's performance regresses by 9%, but s351 exhibits a 76% speedup; all others are unchanged)
> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl: 13% speedup (loops s351 and s353 are rerolled, s353's performance is essentially unchanged, but s351 exhibits a 38% speedup; all others are unchanged)
> FreeBench/distray/distray: No significant change
> 
> Please review.
> 
> Thanks again,
> Hal
> 
> http://llvm-reviews.chandlerc.com/D1940
> 
> Files:
>  include/llvm-c/Transforms/Scalar.h
>  include/llvm/InitializePasses.h
>  include/llvm/LinkAllPasses.h
>  include/llvm/Transforms/Scalar.h
>  lib/Transforms/Scalar/CMakeLists.txt
>  lib/Transforms/Scalar/LoopRerollPass.cpp
>  lib/Transforms/Scalar/Scalar.cpp
>  test/Transforms/LoopReroll/basic.ll
> <D1940.1.patch>