[PATCH] Loop Rerolling Pass

Hal Finkel hfinkel at anl.gov
Wed Oct 16 13:22:23 PDT 2013


----- Original Message -----
> ----- Original Message -----
> > 
> > On Oct 15, 2013, at 12:09 PM, hfinkel at anl.gov wrote:
> > 
> > > Hi nadav, rengolin, atrick,
> > > 
> > > I've created a loop rerolling pass. The transformation aims to
> > > take
> > > loops like this:
> > > 
> > >  for (int i = 0; i < 3200; i += 5) {
> > >    a[i] += alpha * b[i];
> > >    a[i + 1] += alpha * b[i + 1];
> > >    a[i + 2] += alpha * b[i + 2];
> > >    a[i + 3] += alpha * b[i + 3];
> > >    a[i + 4] += alpha * b[i + 4];
> > >  }
> > > 
> > > and turn them into this:
> > > 
> > >  for (int i = 0; i < 3200; ++i) {
> > >    a[i] += alpha * b[i];
> > >  }
> > > 
> > > and loops like this:
> > > 
> > >  for (int i = 0; i < 500; ++i) {
> > >    x[3*i] = foo(0);
> > >    x[3*i+1] = foo(0);
> > >    x[3*i+2] = foo(0);
> > >  }
> > > 
> > > and turn them into this:
> > > 
> > >  for (int i = 0; i < 1500; ++i) {
> > >    x[i] = foo(0);
> > >  }
> > > 
> > > There are two motivations for this transformation:
> > > 
> > > 1. Code-size reduction (especially relevant, obviously, when
> > > compiling for code size).
> > > 
> > > 2. Providing greater choice to the loop vectorizer (and generic
> > > unroller) to choose the unrolling factor (and a better ability to
> > > vectorize). The loop vectorizer can take vector lengths and
> > > register pressure into account when choosing an unrolling factor,
> > > for example, and a pre-unrolled loop limits that choice. This is
> > > especially problematic if the manual unrolling was optimized for
> > > a
> > > machine different from the current target.
> > > 
> > > The current implementation is limited to single basic-block loops
> > > only. The rerolling recognition should work regardless of how the
> > > loop iterations are intermixed within the loop body (subject to
> > > dependency and side-effect constraints), but the significant
> > > restriction is that the order of the instructions in each
> > > iteration must be identical. This seems sufficient to capture all
> > > of my current use cases.
> > > 
> > > The transformation triggers very rarely on the test suite (which
> > > I
> > > think it good, programmers should be able to leave trivial
> > > unrolling to the compiler). When I insert this pass just prior to
> > > loop vectorization, and prior to SLP vectorization (so that we
> > > prefer to reroll over SLP vectorizing), it helps:
> > > 
> > > On an Intel Xeon E5430:
> > > MultiSource/Benchmarks/TSVC/LoopRerolling-flt: 36% speedup (loops
> > > s351 and s353 are rerolled, s353's performance regresses by 9%,
> > > but s351 exhibits a 76% speedup; all others are unchanged)
> > > MultiSource/Benchmarks/TSVC/LoopRerolling-dbl: 13% speedup (loops
> > > s351 and s353 are rerolled, s353's performance is essentially
> > > unchanged, but s351 exhibits a 38% speedup; all others are
> > > unchanged)
> > > FreeBench/distray/distray: No significant change
> > > 
> > > Please review.
> > 
> > Thanks Hal. This looks useful.
> > 
> > Superficially the code looks ok as a first implementation. I can’t
> > say I’ve reviewed it in depth. One question:
> > 
> > +      AU.addRequired<LoopInfo>();
> > +      // Note: We don't preserve LoopInfo because we might add a
> > canonical
> > +      // induction variable where there was not one before.
> > 
> > I think adding a canonical IV is fine for any pass to do that needs
> > it. But can you explain how that invalidates LoopInfo? That doesn't
> > seem necessary.
> 
> This is a good point, thanks! The pass does not need to create a
> canonical induction variable, but does happen to create one in a lot
> of common cases (common among the uncommon cases in which we do
> anything at all). I think that what I'd like to do is add a method
> to LoopInfo so that I can add a canonical induction variable should
> I happen to create one. Then I can preserve LoopInfo and not cause
> any confusing downstream behavior.

Never mind, my comment was just wrong. I had thought that getCanonicalInductionVariable returned a cached result, but it does not (it searches for an appropriate PHI on every invocation). I can preserve LoopInfo as is.

Thanks again,
Hal

> 
> Thanks again,
> Hal
> 
> > 
> > -Andy
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory




More information about the llvm-commits mailing list