[PATCH] Loop Rerolling Pass
Andrew Trick
atrick at apple.com
Wed Oct 16 12:25:44 PDT 2013
On Oct 15, 2013, at 12:09 PM, hfinkel at anl.gov wrote:
> Hi nadav, rengolin, atrick,
>
> I've created a loop rerolling pass. The transformation aims to take loops like this:
>
> for (int i = 0; i < 3200; i += 5) {
> a[i] += alpha * b[i];
> a[i + 1] += alpha * b[i + 1];
> a[i + 2] += alpha * b[i + 2];
> a[i + 3] += alpha * b[i + 3];
> a[i + 4] += alpha * b[i + 4];
> }
>
> and turn them into this:
>
> for (int i = 0; i < 3200; ++i) {
> a[i] += alpha * b[i];
> }
>
> and loops like this:
>
> for (int i = 0; i < 500; ++i) {
> x[3*i] = foo(0);
> x[3*i+1] = foo(0);
> x[3*i+2] = foo(0);
> }
>
> and turn them into this:
>
> for (int i = 0; i < 1500; ++i) {
> x[i] = foo(0);
> }
>
> There are two motivations for this transformation:
>
> 1. Code-size reduction (especially relevant, obviously, when compiling for code size).
>
> 2. Providing greater choice to the loop vectorizer (and generic unroller) to choose the unrolling factor (and a better ability to vectorize). The loop vectorizer can take vector lengths and register pressure into account when choosing an unrolling factor, for example, and a pre-unrolled loop limits that choice. This is especially problematic if the manual unrolling was optimized for a machine different from the current target.
>
> The current implementation is limited to single basic-block loops only. The rerolling recognition should work regardless of how the loop iterations are intermixed within the loop body (subject to dependency and side-effect constraints), but the significant restriction is that the order of the instructions in each iteration must be identical. This seems sufficient to capture all of my current use cases.
>
> The transformation triggers very rarely on the test suite (which I think it good, programmers should be able to leave trivial unrolling to the compiler). When I insert this pass just prior to loop vectorization, and prior to SLP vectorization (so that we prefer to reroll over SLP vectorizing), it helps:
>
> On an Intel Xeon E5430:
> MultiSource/Benchmarks/TSVC/LoopRerolling-flt: 36% speedup (loops s351 and s353 are rerolled, s353's performance regresses by 9%, but s351 exhibits a 76% speedup; all others are unchanged)
> MultiSource/Benchmarks/TSVC/LoopRerolling-dbl: 13% speedup (loops s351 and s353 are rerolled, s353's performance is essentially unchanged, but s351 exhibits a 38% speedup; all others are unchanged)
> FreeBench/distray/distray: No significant change
>
> Please review.
Thanks Hal. This looks useful.
Superficially the code looks ok as a first implementation. I can’t say I’ve reviewed it in depth. One question:
+ AU.addRequired<LoopInfo>();
+ // Note: We don't preserve LoopInfo because we might add a canonical
+ // induction variable where there was not one before.
I think adding a canonical IV is fine for any pass to do that needs it. But can you explain how that invalidates LoopInfo? That doesn't seem necessary.
-Andy
More information about the llvm-commits
mailing list