[llvm] r189281 - LoopVectorize: Implement partial loop unrolling when vectorization is not profitable.

Thu Aug 29 17:57:38 PDT 2013

On Wed, Aug 28, 2013 at 4:08 PM, Renato Golin <renato.golin at linaro.org> wrote:
> On 28 August 2013 21:58, Eric Christopher <echristo at gmail.com> wrote:
>>
>> Sure, it seems reasonable to me that this should be hoisted out to
>> some analysis and then stuck in the general loop unrolling pass. What
>> do you think?
>
>
> Hi Eric,
>
> The difference here, I assume, is that the LoopVectorizer has more
> information than the simple loop unrolling pass, and thus can know that a
> transformation is profitable.

Right. I was wondering if that information was useful to the general
partial unroller.

It seems like I'm the only one asking so... *shrug* :)

-eric

>
> We had similar discussions before, even in the Polly era: where does the
> analysis end and the implementation begins?
>
> There was some consensus that vectorizers should have three (not necessarily
> distinct or unique) passes:
>  1. The first pass, the annotation phase, where costs would be calculated,
> transformations would be validated and metadata would be written to loops,
> basic-blocks and, possibly, instructions. The Legalizer and the CostTable do
> that job, but doesn't annotate anything.
>  2. The second pass would then do the target-independent transformation,
> based on the previous annotation. This is more or less what the current
> vectorizers do, trusting that step 1 is sure that the transformation is
> legal and worthy.
>  3. A third pass would then do more target-specific changes, with sub-target
> information, like this very case, if you know your CPU is OOO. This is
> partially done by the cost tables and the TTI, but not explicitly.
>
> Because step 1 is not annotating, that information can't be used outside the
> vectorizers, and because the cost tables and the target transform info are
> holding target-specific information, you don't (yet) need a third-stage.
>
> But things start to get grey with the example Nadav gave. That seems more
> profitable on OOO CPUs, but probably not others, and since non-OOO CPUs are
> still being designed today, that might be a target-specific approach on a
> target-agnostic area. Also, since there is no annotation, other passes
> cannot profit from the information that the vectorizer calculated, throwing
> away precious cycles or duplicating code into the vectorizer.
>
> So, I agree that we could do better, but we'll need some co-joint work on
> the vectorizer if we are to make it more generic while still maintaining its
> hard-earned performance boost on, at least, x86 and ARM.
>
> On the other hand, maybe the loop-unrolling pass should be merged into the
> loop vectorizer...
>
> cheers,
> --renato