[llvm-dev] RFC: Extending LV to vectorize outerloops

Thu Sep 22 15:35:04 PDT 2016

----- Original Message -----

> From: "Ayal via llvm-dev Zaks" <llvm-dev at lists.llvm.org>
> To: "LLVM Dev" <llvm-dev at lists.llvm.org>
> Sent: Wednesday, September 21, 2016 12:50:34 PM
> Subject: [llvm-dev] RFC: Extending LV to vectorize outerloops

> Proposal for extending the Loop Vectorizer to handle Outer Loops
> ================================================================
> Goal:
> -----
> We propose to extend the innermost Loop Vectorizer to also handle
> outerloops (cf.[1]). Our aim is to best leverage the efforts already
> invested in the existing innermost Loop Vectorizer rather than
> introduce a separate pass dedicated to outerloop vectorization.
Good; I think it is important that we aim for incremental progress here; taking advantage of the existing infrastructure where reasonable. 

> This proposal will support explicit vector programming of loops and
> functions [2]. It also facilitates evaluating multiple vectorization
> candidates, including both outer and innermost loops, to choose the
> most profitable ones.

> Background:
> -----------
> The current Loop Vectorizer is confined to handle only innermost
> loops. In order to extend it to also handle outerloops the following
> design decisions need to be reworked:
> 1. The resulting vectorized loop is inherently designed to contain a
> * single * basic block. This poses an issue today, as innermost
> loops may benefit from retaining some internal branches when
> vectorized. For outerloops this clearly cannot hold – the resulting
> vectorized loop will contain more than a single basic block as it
> will contain innerloops.
> 2. There is inherently a single vectorization candidate with a single
> dimension of optimization – namely the Vectorization Factor and/or
> Unrolling Factor of the innermost loop. When dealing with outerloops
> it is important to evaluate multiple vectorization candidates –
> including both outer and inner loops, and their respective VF/UF’s.

> The current Loop Vectorizer works in the following three main steps:
> 1. Legal Step: check if loop can be legally vectorized; encode
> constraints and artifacts if so.
> 2. Cost Step: compute the relative cost of vectorizing it along
> possible vectorization and unroll factors.
> 3. Transform Step: vectorize the loop according to best vectorization
> and unroll factors.

> This design has some implications:
> 1. Cost Step tries to predict what the vectorized loop will look like
> and how much it will cost, independently of what the Transform Step
> will eventually do. It’s hard to keep the two in sync, enforcing the
> comment placed at the beginning of Transform Step:
> // Notice: any optimization or new instruction that go
> // into the code below should be also be implemented in
> // the cost-model.
> 2. Legal Step does more than check for vectorizability; e.g., it
> records auxiliary artifacts such as collectLoopUniforms() and
> InterleaveInfo.
> 3. Transform Step first populates the single basic block of the
> vectorized loop and later revisits scalarized instructions to
> predicate them one by one, as needed.

> Proposal: introduce the Vectorization Plan as an explicit model of a
> vectorization candidate and update the overall flow:
> 1. Legal Step: check if loop can be legally vectorized, encode
> constraints by initiating Vectorization Plan(s) if so.
> 2. Plan Step:
> a. Build initial vectorization plans complying with all legal
> constraints.
> b. Optimize vectorization plans.
> 3. Cost Step: compute the relative cost of each plan. This step can
> be applied repeatedly by Plan Optimize Step 2.b.
> 4. Transform Step: materialize the best plan. Note that only this
> step modifies the IR, as in the current Loop Vectorizer.
This is also important: We should not speculatively modify the IR. 

> The Vectorization Plan is a recipe describing the vectorization
> candidate, including for instance its internal control-flow. It
> serves for both estimating the cost and for performing the
> translation, and facilitates dealing with multiple vectorization
> candidates. One can compare with LLVM’s existing SLP vectorizer,
> where TSLP [3] adds step 2.b.

> As the scope of vectorization grows from innermost to outer loops, so
> do the uncertainty and complexity of each step. One way to mitigate
> the shortcomings of the Legal and Cost steps is to rely on
> programmers to indicate which loops can and/or should be vectorized.
> This is implicit for certain loops in data-parallel languages such
> as OpenCL [4,5] and explicit in others such as OpenMP [6]. This
> proposal to extend the Loop Vectorizer to outer loops supports and
> raises the importance of explicit vectorization beyond the current
> capabilities of Clang and LLVM. Namely, from currently forcing the
> vectorization of innermost loops according to prescribed width
> and/or interleaving count, to supporting OpenMP’s “#pragma omp simd”
> construct and associated clauses, including our earlier proposal for
> vectorizing across function boundaries [2].
Our current metadata scheme for controlling the loop vectorizer should prove easy to extend, if extension is necessary at all, to provide explicitly vectorization factors, etc. for various loop-nest levels. 

When we have profiling data, or can statically determine the inner-loop trip counts, then it should also be possible to reasonably decide on outer-loop vectorization. We'll need to put some thought into how to collect inner-loop trip counts from profiling data when the code is outer-loop vectorized. There may also be cases where outer-loop vectorization always seem more profitable, but we obviously can't evaluate an exponential number of potential vectorization plans, so we'll need to design some heuristic (perhaps based on reductions, consecutive memory accesses, etc.) to limit the search space. 

This is an important capability; thanks for pursuing it! 

-Hal 

> Comments on this overall approach are welcome. The first patches
> we’re working on are designed to have the innermost Loop Vectorizer
> explicitly model the control flow of its vectorized loop. More will
> be presented in our technical talk at the upcoming LLVM Developers'
> Meeting.

> References:
> -----------
> [1] “Outer-loop vectorization: revisited for short SIMD
> architectures”, Dorit Nuzman and Ayal Zaks, PACT 2008.
> [2] “Proposal for function vectorization and loop vectorization with
> function calls”, Xinmin Tian, [cfe-dev] March 2, 2016 (
> http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html . See
> also https://reviews.llvm.org/D22792 ).
> [3] “Throttling Automatic Vectorization: When Less is More”,
> Vasileios Porpodas and Tim Jones, PACT 2015 and LLVM Developers'
> Meeting 2015.
> [4] “Intel OpenCL SDK Vectorizer”, Nadav Rotem, LLVM Developers'
> Meeting 2011.
> [5] “Automatic SIMD Vectorization of SSA-based Control Flow Graphs”,
> Ralf Karrenberg, Springer 2015. See also “Improving Performance of
> OpenCL on CPUs”, LLVM Developers' Meeting 2012.
> [6] “Compiling C/C++ SIMD Extensions for Function and Loop
> Vectorization on Multicore-SIMD Processors”, Xinmin Tian and Hideki
> Saito et al., IPDPSW 2012.
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-- 

Hal Finkel 
Lead, Compiler Technology and Programming Languages 
Leadership Computing Facility 
Argonne National Laboratory 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160922/48a7d64a/attachment.html>