[llvm] r189281 - LoopVectorize: Implement partial loop unrolling when vectorization is not profitable.

Wed Aug 28 16:08:27 PDT 2013

On 28 August 2013 21:58, Eric Christopher <echristo at gmail.com> wrote:

> Sure, it seems reasonable to me that this should be hoisted out to
> some analysis and then stuck in the general loop unrolling pass. What
> do you think?
>

Hi Eric,

The difference here, I assume, is that the LoopVectorizer has more
information than the simple loop unrolling pass, and thus can know that a
transformation is profitable.

We had similar discussions before, even in the Polly era: where does the
analysis end and the implementation begins?

There was some consensus that vectorizers should have three (not
necessarily distinct or unique) passes:
 1. The first pass, the annotation phase, where costs would be calculated,
transformations would be validated and metadata would be written to loops,
basic-blocks and, possibly, instructions. The Legalizer and the CostTable
do that job, but doesn't annotate anything.
 2. The second pass would then do the target-independent transformation,
based on the previous annotation. This is more or less what the current
vectorizers do, trusting that step 1 is sure that the transformation is
legal and worthy.
 3. A third pass would then do more target-specific changes, with
sub-target information, like this very case, if you know your CPU is OOO.
This is partially done by the cost tables and the TTI, but not explicitly.

Because step 1 is not annotating, that information can't be used outside
the vectorizers, and because the cost tables and the target transform info
are holding target-specific information, you don't (yet) need a
third-stage.

But things start to get grey with the example Nadav gave. That seems more
profitable on OOO CPUs, but probably not others, and since non-OOO CPUs are
still being designed today, that might be a target-specific approach on a
target-agnostic area. Also, since there is no annotation, other passes
cannot profit from the information that the vectorizer calculated, throwing
away precious cycles or duplicating code into the vectorizer.

So, I agree that we could do better, but we'll need some co-joint work on
the vectorizer if we are to make it more generic while still maintaining
its hard-earned performance boost on, at least, x86 and ARM.

On the other hand, maybe the loop-unrolling pass should be merged into the
loop vectorizer...

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130829/b941ba61/attachment.html>