[PATCH] Don't unroll loops in loop vectorization pass when VF is one.

Tue Apr 14 14:57:15 PDT 2015

----- Original Message -----
> From: "Wei Mi" <wmi at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: reviews+D9007+public+aa136303c547bc31 at reviews.llvm.org, llvm-commits at cs.uiuc.edu, "James Molloy"
> <james at jamesmolloy.co.uk>
> Sent: Tuesday, April 14, 2015 2:21:44 PM
> Subject: Re: [PATCH] Don't unroll loops in loop vectorization pass when VF is one.
> 
> >> Another point is that if vectorization is turned off, the runtime
> >> check will be gone. It doesn't make sense to depend on
> >> vectorization
> >> always being turned on.
> >
> > This was a bug some time ago; I think it has been fixed now. The
> > vectorizer will always potentially unroll regardless of whether it
> > is allowed to do any actual vectorization.
> >
> 
> If VF==1, unroll will still be tried. But if -fno-vectorize is used,
> no vectorization and no unroll will be done in loop vectorizer. I
> verified it using the testcase in
> https://llvm.org/bugs/show_bug.cgi?id=23217

Interesting. We create the loop vectorization pass in PassManagerBuilder like this:

    MPM.add(createLoopVectorizePass(DisableUnrollLoops, LoopVectorize));

specifically, as I recall, so that this wouldn't happen.

> 
> >>
> >> I didn't see performance regressions in spec2000 and our internal
> >> benchmarks after applying this patch on x86, but it is possible
> >> that
> >> is because apps are not performance sensitive to compiler
> >> scheduling
> >> since x86 is out of order. So maybe the patch at least makes sense
> >> for
> >> x86 for now?
> >
> > Agreed; you need to be careful here, the vectorizer's unrolling
> > (interleaving) transformation gives must greater speedups on
> > simpler cores with longer pipelines. X86 is much less sensitive to
> > this, at least the server-level cores (atom, silvermont, etc.
> > might be different).
> >
> > Doing this during scheduling sounds nice in theory, but making the
> > decision in the scheduler might be even harder than it is here.
> > The scheduler does not really know anything about loops, and does
> > not make speculative scheduling decisions. For the scheduler to
> > make a decision about inserting runtime checks, it would need both
> > capabilities, and making speculative schedules to evaluate the
> > need for runtime checks could get very (compile-time) expensive.
> > In addition, you really want other optimizations to fire after the
> > checks are inserted, which is not possible if you insert them very
> > late in the pipeline.
> >
> > All of this having been said, the interleaved unrolling should,
> > generally speaking, put less pressure on the reorder buffer(s),
> > and should be preferable to the concatenation unrolling done by
> > the regular unroller. Furthermore, they should both fire if the
> > interleaved unrolling still did not make the loop large enough.
> > Why is this not happening?
> 
> It is happening (The interleaved unrolling and regular unroller both
> fired). But it is not perf efficient.
> 
> after interleaved unrolling, the original loop becomes:
>     overflow check block + memcheck block + kernel loop unrolled by 2
> + remainder loop.
> then regular unroll loop further convert it to:
>     overflow check block + memcheck block + prologue loop for kernel
> loop + kernel loop unrolled by 4 + prologue loop for remainder loop +
> remainder loop unrolled by 4.

Thanks for elaborating here; this is clearly not optimal.

> 
> For x86, since the extra overflow check block and memcheck block have
> extra cost, I inclined to remove the unrolling in vectorization on
> x86, and let regular unroller do all the jobs. For other
> architectures, it may be better to adjust the unrolling cost model in
> loop vectorization and let it finish the unroll job all at once, to
> remove the extra prologue loop costs. Does it make sense?

It makes sense; how does this compare, performance-wise, to other options. For example, what happens if you force the vectorizer to unroll by 4x? The main difference is the cost of the memory-overlap checking, right? I agree that avoiding the memory checks makes sense when the expected benefit from them is low.

 -Hal

> 
> Thanks,
> Wei.
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory