[PATCH] Loop Rerolling Pass

Wed Oct 16 12:25:44 PDT 2013

On Oct 15, 2013, at 12:09 PM, hfinkel at anl.gov wrote:

> Hi nadav, rengolin, atrick,
> 
> I've created a loop rerolling pass. The transformation aims to take loops like this:
> 
>  for (int i = 0; i < 3200; i += 5) {
>    a[i] += alpha * b[i];
>    a[i + 1] += alpha * b[i + 1];
>    a[i + 2] += alpha * b[i + 2];
>    a[i + 3] += alpha * b[i + 3];
>    a[i + 4] += alpha * b[i + 4];
>  }
> 
> and turn them into this:
> 
>  for (int i = 0; i < 3200; ++i) {
>    a[i] += alpha * b[i];
>  }
> 
> and loops like this:
> 
>  for (int i = 0; i < 500; ++i) {
>    x[3*i] = foo(0);
>    x[3*i+1] = foo(0);
>    x[3*i+2] = foo(0);
>  }
> 
> and turn them into this:
> 
>  for (int i = 0; i < 1500; ++i) {
>    x[i] = foo(0);
>  }
> 
> There are two motivations for this transformation:
> 
> 1. Code-size reduction (especially relevant, obviously, when compiling for code size).
> 
> 2. Providing greater choice to the loop vectorizer (and generic unroller) to choose the unrolling factor (and a better ability to vectorize). The loop vectorizer can take vector lengths and register pressure into account when choosing an unrolling factor, for example, and a pre-unrolled loop limits that choice. This is especially problematic if the manual unrolling was optimized for a machine different from the current target.
> 
> The current implementation is limited to single basic-block loops only. The rerolling recognition should work regardless of how the loop iterations are intermixed within the loop body (subject to dependency and side-effect constraints), but the significant restriction is that the order of the instructions in each iteration must be identical. This seems sufficient to capture all of my current use cases.
> 
> The transformation triggers very rarely on the test suite (which I think it good, programmers should be able to leave trivial unrolling to the compiler). When I insert this pass just prior to loop vectorization, and prior to SLP vectorization (so that we prefer to reroll over SLP vectorizing), it helps:
> 
> On an Intel Xeon E5430:
> MultiSource/Benchmarks/TSVC/LoopRerolling-flt: 36% speedup (loops s351 and s353 are rerolled, s353's performance regresses by 9%, but s351 exhibits a 76% speedup; all others are unchanged)
> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl: 13% speedup (loops s351 and s353 are rerolled, s353's performance is essentially unchanged, but s351 exhibits a 38% speedup; all others are unchanged)
> FreeBench/distray/distray: No significant change
> 
> Please review.

Thanks Hal. This looks useful.

Superficially the code looks ok as a first implementation. I can’t say I’ve reviewed it in depth. One question:

+      AU.addRequired<LoopInfo>();
+      // Note: We don't preserve LoopInfo because we might add a canonical
+      // induction variable where there was not one before.

I think adding a canonical IV is fine for any pass to do that needs it. But can you explain how that invalidates LoopInfo? That doesn't seem necessary.

-Andy