[PATCH] Don't unroll loops in loop vectorization pass when VF is one.

Tue Apr 14 12:33:48 PDT 2015

On Tue, Apr 14, 2015 at 12:21 PM, Wei Mi <wmi at google.com> wrote:

> >> Another point is that if vectorization is turned off, the runtime
> >> check will be gone. It doesn't make sense to depend on vectorization
> >> always being turned on.
> >
> > This was a bug some time ago; I think it has been fixed now. The
> vectorizer will always potentially unroll regardless of whether it is
> allowed to do any actual vectorization.
> >
>
> If VF==1, unroll will still be tried. But if -fno-vectorize is used,
> no vectorization and no unroll will be done in loop vectorizer. I
> verified it using the testcase in
> https://llvm.org/bugs/show_bug.cgi?id=23217

A side note: Longer term, I think the alias based loop versioning should be
done as separate enabler pass. Interleaving unroller, vectorizer,
instruction scheduler are passes enabled/enhanced by it.

 David

>
>
> >>
> >> I didn't see performance regressions in spec2000 and our internal
> >> benchmarks after applying this patch on x86, but it is possible that
> >> is because apps are not performance sensitive to compiler scheduling
> >> since x86 is out of order. So maybe the patch at least makes sense
> >> for
> >> x86 for now?
> >
> > Agreed; you need to be careful here, the vectorizer's unrolling
> (interleaving) transformation gives must greater speedups on simpler cores
> with longer pipelines. X86 is much less sensitive to this, at least the
> server-level cores (atom, silvermont, etc. might be different).
> >
> > Doing this during scheduling sounds nice in theory, but making the
> decision in the scheduler might be even harder than it is here. The
> scheduler does not really know anything about loops, and does not make
> speculative scheduling decisions. For the scheduler to make a decision
> about inserting runtime checks, it would need both capabilities, and making
> speculative schedules to evaluate the need for runtime checks could get
> very (compile-time) expensive. In addition, you really want other
> optimizations to fire after the checks are inserted, which is not possible
> if you insert them very late in the pipeline.
> >
> > All of this having been said, the interleaved unrolling should,
> generally speaking, put less pressure on the reorder buffer(s), and should
> be preferable to the concatenation unrolling done by the regular unroller.
> Furthermore, they should both fire if the interleaved unrolling still did
> not make the loop large enough. Why is this not happening?
>
> It is happening (The interleaved unrolling and regular unroller both
> fired). But it is not perf efficient.
>
> after interleaved unrolling, the original loop becomes:
>     overflow check block + memcheck block + kernel loop unrolled by 2
> + remainder loop.
> then regular unroll loop further convert it to:
>     overflow check block + memcheck block + prologue loop for kernel
> loop + kernel loop unrolled by 4 + prologue loop for remainder loop +
> remainder loop unrolled by 4.
>
> For x86, since the extra overflow check block and memcheck block have
> extra cost, I inclined to remove the unrolling in vectorization on
> x86, and let regular unroller do all the jobs. For other
> architectures, it may be better to adjust the unrolling cost model in
> loop vectorization and let it finish the unroll job all at once, to
> remove the extra prologue loop costs. Does it make sense?
>
> Thanks,
> Wei.
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150414/d3bf2fff/attachment.html>