[llvm-dev] RFC: Extending LV to vectorize outerloops

Thu Sep 22 16:03:56 PDT 2016

On 21 September 2016 at 18:50, Zaks, Ayal via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> 1.  The resulting vectorized loop is inherently designed to contain a
> *single* basic block. This poses an issue today, as innermost loops may
> benefit from retaining some internal branches when vectorized. For
> outerloops this clearly cannot hold – the resulting vectorized loop will
> contain more than a single basic block as it will contain innerloops.

There's some space for if-conversion, but it is true that the loop
vectoriser gives up too easily on anything but canonical loops. That's
why we even have loop canonicalisation steps.

Though, the alternative would be to expand the solution space beyond
anything we can do in low-polinomial time and space. I also don't want
us having completely different strategies for simple inner loops,
conditionalised inner loops, outerloops with one inner loop, more than
one inner loop, multiply-nested loops, etc. That' would be a
maintenance nightmare.

We haven't focused on outer loops for a reason. Doing one of those
strategies is not that complex, but doing one without penalising the
other (or all others), is. We need to at least have a plan for a few,
even if we implement only at a time.

> 1.  Cost Step tries to predict what the vectorized loop will look like and
> how much it will cost, independently of what the Transform Step will
> eventually do. It’s hard to keep the two in sync, enforcing  the comment
> placed at the beginning of Transform Step:
> // Notice: any optimization or new instruction that go
> // into the code below should be also be implemented in
> // the cost-model.

This is *particularly* bad. Your proposal was put forward during the
first year of the vectorisation and was deemed too complex for our
initial goals. I think now is the time to start thinking again about
this.

> 1.  Legal Step: check if loop can be legally vectorized, encode constraints
> by initiating Vectorization Plan(s) if so.

Not much different than today. We'd need to inspect the evolution of
all inner-loops to be able to infer anything about the outer-loop
access, but that shouldn't be too complex.

> 2.  Plan Step:
> a.  Build initial vectorization plans complying with all legal constraints.
> b.  Optimize vectorization plans.

Are you proposing we at least generate some straw-man vectorised loop
in IR form? Or is this still just an improved cost-model?

> 3.  Cost Step: compute the relative cost of each plan. This step can be
> applied repeatedly by Plan Optimize Step 2.b.

It's cheap to calculate the costs (mostly linear), but it's not cheap
to come up with multiple plans on step 2 to feed a voting system.

> 4.  Transform Step: materialize the best plan. Note that only this step
> modifies the IR, as in the current Loop Vectorizer.

Modifies is a dubious term, I just want to make sure we're on the same page.

You can create as many basic blocks as you want, but unless you
replace the old ones with the new ones, the "IR" isn't changed.

So, if I got ir right, you want to:

1. check legal, pass.
2. create some structure with an actual implementation of the
vectorisation (IR?)
3. min(cost[i])
4. Implement (or IR replace) implementation "i".

If 2 is similar to 4, than this will be very costly. If not, then we
have two *different* representations, and we go back to having a bad
cost-model.

The former can be justified if you pass an option like
"-faggressive-vectorizer". The latter can make our "diverging cost
model" problem much worse.

> The Vectorization Plan is a recipe describing the vectorization candidate,
> including for instance its internal control-flow. It serves for both
> estimating the cost and for performing the translation, and facilitates
> dealing with multiple vectorization candidates. One can compare with LLVM’s
> existing SLP vectorizer, where TSLP [3] adds step 2.b.

We sort of do that already, since we pick the costs of all VFs and
choose the "best". By far not perfect, but I envisioned an improved
cost-model to having more than just VF/UF and trying skew factors
(with information provided by Polly) or fiddling with the induction
sequence, before/after inner-loop hoisting, etc.

Since the cost model is really cheap, we can do many of them each time
and even be very creative around them. But the more creative we are,
the harder it'll be to get them close to the implementation phase.

> This proposal to extend the Loop Vectorizer to
> outer loops supports and raises the importance of explicit vectorization
> beyond the current capabilities of Clang and LLVM. Namely, from currently
> forcing the vectorization of innermost loops according to prescribed width
> and/or interleaving count, to supporting OpenMP’s “#pragma omp simd”
> construct and associated clauses, including our earlier proposal for
> vectorizing across function boundaries [2].

We already have all those pragmas, some Clang-specific, some borrowed
from OpenMP SIMD.

As Hal said, extending the pragmas to indicate recommended trip
counts, guaranteeing no-wrapping, etc. shouldn't be too hard.

cheers,
-renato