[llvm-dev] RFC: Extending LV to vectorize outerloops

Fri Sep 23 16:08:41 PDT 2016

> -----Original Message-----
> From: Renato Golin [mailto:renato.golin at linaro.org]
> Sent: Friday, September 23, 2016 02:04
> To: Zaks, Ayal <ayal.zaks at intel.com>
> Cc: LLVM Dev <llvm-dev at lists.llvm.org>
> Subject: Re: [llvm-dev] RFC: Extending LV to vectorize outerloops
> 
> On 21 September 2016 at 18:50, Zaks, Ayal via llvm-dev <llvm-
> dev at lists.llvm.org> wrote:
> > 1.  The resulting vectorized loop is inherently designed to contain a
> > *single* basic block. This poses an issue today, as innermost loops
> > may benefit from retaining some internal branches when vectorized. For
> > outerloops this clearly cannot hold – the resulting vectorized loop
> > will contain more than a single basic block as it will contain innerloops.
> 
> There's some space for if-conversion, but it is true that the loop vectoriser
> gives up too easily on anything but canonical loops. That's why we even
> have loop canonicalisation steps.
> 
> Though, the alternative would be to expand the solution space beyond
> anything we can do in low-polinomial time and space. I also don't want us
> having completely different strategies for simple inner loops,
> conditionalised inner loops, outerloops with one inner loop, more than one
> inner loop, multiply-nested loops, etc. That' would be a maintenance
> nightmare.
> 
> We haven't focused on outer loops for a reason. Doing one of those
> strategies is not that complex, but doing one without penalising the other
> (or all others), is. We need to at least have a plan for a few, even if we
> implement only at a time.
> 

Our first step is to make LV fully conscious of the control-flow that will result from vectorizing an (innermost) loop, allowing this control-flow to comprise of more than a single basic-block. A next step is to consider retaining branches rather than fully if-converting them. This could indeed entail a large solution space, but simple heuristics are usually employed to prune it effectively. When we reach the ability to handle outerloops the solution space grows further, and we fully recognize that early pruning of candidates will be important. As Hal noted, some heuristics can indeed be employed here too.

> 
> > 1.  Cost Step tries to predict what the vectorized loop will look like
> > and how much it will cost, independently of what the Transform Step
> > will eventually do. It’s hard to keep the two in sync, enforcing  the
> > comment placed at the beginning of Transform Step:
> > // Notice: any optimization or new instruction that go // into the
> > code below should be also be implemented in // the cost-model.
> 
> This is *particularly* bad. Your proposal was put forward during the first
> year of the vectorisation and was deemed too complex for our initial goals. I
> think now is the time to start thinking again about this.
> 
> 
> > 1.  Legal Step: check if loop can be legally vectorized, encode
> > constraints by initiating Vectorization Plan(s) if so.
> 
> Not much different than today. We'd need to inspect the evolution of all
> inner-loops to be able to infer anything about the outer-loop access, but
> that shouldn't be too complex.
> 

Hopefully, based on SCEV's analysis that considers nested loops.

> 
> > 2.  Plan Step:
> > a.  Build initial vectorization plans complying with all legal constraints.
> > b.  Optimize vectorization plans.
> 
> Are you proposing we at least generate some straw-man vectorised loop in
> IR form? Or is this still just an improved cost-model?
> 

This step is designed to generate a straw-man vectorized loop in "shadow" form, whose cost can be directly computed (3), and which facilitates generating the actual vector IR instructions (4).

> 
> > 3.  Cost Step: compute the relative cost of each plan. This step can
> > be applied repeatedly by Plan Optimize Step 2.b.
> 
> It's cheap to calculate the costs (mostly linear), but it's not cheap to come up
> with multiple plans on step 2 to feed a voting system.
> 

("cheap" - in terms of compile-time I take it)
Care must be taken therefore when deciding how many plans to draft. Step 2.b can consult the cost of partial plans being considered, for early plan pruning.

> 
> > 4.  Transform Step: materialize the best plan. Note that only this
> > step modifies the IR, as in the current Loop Vectorizer.
> 
> Modifies is a dubious term, I just want to make sure we're on the same page.
> 
> You can create as many basic blocks as you want, but unless you replace the
> old ones with the new ones, the "IR" isn't changed.
> 

This relates to the "shadow" instructions and basic-blocks discussed recently in http://lists.llvm.org/pipermail/llvm-dev/2016-August/104317.html.

> So, if I got ir right, you want to:
> 
> 1. check legal, pass.
> 2. create some structure with an actual implementation of the vectorisation
> (IR?) 3. min(cost[i]) 4. Implement (or IR replace) implementation "i".
> 
> If 2 is similar to 4, than this will be very costly. If not, then we have two

("very costly" - in terms of compile-time I take it; "cost" is overloaded here)

> *different* representations, and we go back to having a bad cost-model.
> 
> The former can be justified if you pass an option like "-faggressive-
> vectorizer". The latter can make our "diverging cost model" problem much
> worse.
> 

Computing cost before having an explicit vectorization plan is indeed more compile-time efficient than first constructing the plan, but inherently less accurate and harder to keep consistent. Such a "divergent" predictive cost could be used to prune candidates, when considered decisive, to save compile-time. Having both a "-faggressive-vectorizer" path that builds a plan and weighs it before implementing it, while keeping the current path that vectorizes directly w/o a plan, may be quite a maintenance burden. A "-faggressive-vectorizer" option could be used to control early pruning of candidates. Let's first see how much compile-time, and improved accuracy, are actually involved.

> 
> > The Vectorization Plan is a recipe describing the vectorization
> > candidate, including for instance its internal control-flow. It serves
> > for both estimating the cost and for performing the translation, and
> > facilitates dealing with multiple vectorization candidates. One can
> > compare with LLVM’s existing SLP vectorizer, where TSLP [3] adds step 2.b.
> 
> We sort of do that already, since we pick the costs of all VFs and choose the
> "best". By far not perfect, but I envisioned an improved cost-model to having
> more than just VF/UF and trying skew factors (with information provided by
> Polly) or fiddling with the induction sequence, before/after inner-loop
> hoisting, etc.
> 
> Since the cost model is really cheap, we can do many of them each time and
> even be very creative around them. But the more creative we are, the harder
> it'll be to get them close to the implementation phase.
> 

OTOH, computing the "non-divergent" costs of explicit plans could lead to the development of pruning techniques, or creative improvements to the cost-model. That is, when these costs turn out to be very high or very low.

Note that some decisions are currently taken w/o fully consulting cost, e.g.:
// FIXME: The interleaved load group with a huge gap could be even more
// expensive than scalar operations. Then we could ignore such group and
// use scalar operations instead.
and the hardwired NumberOfStoresToPredicate threshold.

> 
> > This proposal to extend the Loop Vectorizer to outer loops supports
> > and raises the importance of explicit vectorization beyond the current
> > capabilities of Clang and LLVM. Namely, from currently forcing the
> > vectorization of innermost loops according to prescribed width and/or
> > interleaving count, to supporting OpenMP’s “#pragma omp simd”
> > construct and associated clauses, including our earlier proposal for
> > vectorizing across function boundaries [2].
> 
> We already have all those pragmas, some Clang-specific, some borrowed
> from OpenMP SIMD.
> 
> As Hal said, extending the pragmas to indicate recommended trip counts,
> guaranteeing no-wrapping, etc. shouldn't be too hard.
> 
> cheers,
> -renato

Thanks!
Ayal.
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.