[PATCH] Don't unroll loops in loop vectorization pass when VF is one.

Tue Apr 14 11:50:40 PDT 2015

----- Original Message -----
> From: "Wei Mi" <wmi at google.com>
> To: "James Molloy" <james at jamesmolloy.co.uk>
> Cc: reviews+D9007+public+aa136303c547bc31 at reviews.llvm.org, hfinkel at anl.gov, llvm-commits at cs.uiuc.edu
> Sent: Tuesday, April 14, 2015 1:29:59 PM
> Subject: Re: [PATCH] Don't unroll loops in loop vectorization pass when VF is one.
> 
> On Tue, Apr 14, 2015 at 8:24 AM, James Molloy
> <james at jamesmolloy.co.uk> wrote:
> > Hi Wei,
> >
> > The important difference between loopunroll and the loop vectoriser
> > unrolling is alias checks and interleaving.
> >
> > The normal unroller will concatenate iterations, which without good
> > alias
> > analysis results in a schedule that cannot be reordered. The loop
> > vectoriser
> > will use loopaccessanalysis to plant runtime pointer checks,
> > allowing a much
> > better schedule.
> >
> > So I don't think your patch as-is makes sense, unless
> > loopunrollruntime has
> > grown runtime ptr check support recently.
> >
> > Cheers,
> >
> > James
> 
> Hi James,
> 
> Thank you for letting me know the usage of loopunroll inside loop
> vectorizer when VF==1.
> 
> I think runtime alias check has cost, it can be beneficial for
> scheduling in some cases, but it can also introduce unnecessary
> computations having negative impact on performance in other cases.
> Loop unroll shouldn't do such kind of check blindly. Ideally
> scheduling should decide and do the transformation if it is needed.
> 
> Another point is that if vectorization is turned off, the runtime
> check will be gone. It doesn't make sense to depend on vectorization
> always being turned on.

This was a bug some time ago; I think it has been fixed now. The vectorizer will always potentially unroll regardless of whether it is allowed to do any actual vectorization.

> 
> I didn't see performance regressions in spec2000 and our internal
> benchmarks after applying this patch on x86, but it is possible that
> is because apps are not performance sensitive to compiler scheduling
> since x86 is out of order. So maybe the patch at least makes sense
> for
> x86 for now?

Agreed; you need to be careful here, the vectorizer's unrolling (interleaving) transformation gives must greater speedups on simpler cores with longer pipelines. X86 is much less sensitive to this, at least the server-level cores (atom, silvermont, etc. might be different).

Doing this during scheduling sounds nice in theory, but making the decision in the scheduler might be even harder than it is here. The scheduler does not really know anything about loops, and does not make speculative scheduling decisions. For the scheduler to make a decision about inserting runtime checks, it would need both capabilities, and making speculative schedules to evaluate the need for runtime checks could get very (compile-time) expensive. In addition, you really want other optimizations to fire after the checks are inserted, which is not possible if you insert them very late in the pipeline.

All of this having been said, the interleaved unrolling should, generally speaking, put less pressure on the reorder buffer(s), and should be preferable to the concatenation unrolling done by the regular unroller. Furthermore, they should both fire if the interleaved unrolling still did not make the loop large enough. Why is this not happening?

 -Hal

> 
> If you have testcase to show me the runtime alias check problem
> affecting scheduling on certain platform, it will be very
> appreciated.
> 
> Thanks,
> Wei.
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory