[PATCH] D99750: [LV, VP] RFC: VP intrinsics support for the Loop Vectorizer (Proof-of-Concept)

Thu Apr 1 10:32:54 PDT 2021

vkmr created this revision.
vkmr added reviewers: rogfer01, simoll.
Herald added subscribers: luismarques, s.egerton, PkmX, simoncook, bollu, hiraditya, kristof.beyls.
vkmr requested review of this revision.
Herald added subscribers: llvm-commits, jdoerfert.
Herald added a project: LLVM.

Abstract
========

As Vector Predication intriniscs are being introduced in LLVM, we propose
extending the Loop Vectorizer to target these intrinsics. SIMD ISAs such as
RISC-V V-extension, NEC SX-Aurora and Power VSX with active vector length
predication support can specially benefit from this since there is currently no
reasonable way in the IR to model active vector length in the vector
instructions.

ISAs such as AVX512 and ARM SVE with masked vector predication
support would benefit by being able to use predicated operations other than just
memory operations (via masked load/store/gather/scatter intrinsics).

This patch shows a proof of concept implementation that demonstrates LV
generating VP intrinsics for simple integer operations on fixed vectors.

Details and Strategy
====================

Currently the Loop Vectorizer supports vector predication in a very limited
capacity via tail-folding and masked load/store/gather/scatter intrinsics.
However, this does not let architectures with active vector length predication
support take advantage of their capabilities. Architectures with general masked
predication support also can only take advantage of predication on memory
operations. By having a way for the Loop Vectorizer to generate Vector
Predication intrinsics, which (will) provide a target-independent way to model
predicated vector instructions, These architectures can make better use of their
predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing
tail-folding mechanism in the LV, but instead of generating masked intrinsics for
memory operations it generates VP intrinsics for all arithmetic operations as
well.

Other important part of this approach is how the Explicit Vector Length is
computed.(We use active vector length and explicit vector length
interchangeably; VP intrinsics define this vector length parameter as Explicit
Vector Length (EVL)). We consider the following three ways to compute the EVL
parameter for the VP Intrinsics.

- The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
- The second way is to insert instructions to compute `min(VF, trip_count - index)` for each vector iteration.
- For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic `set_vector_length`, that can be lowered to architecture specific instruction(s) to compute EVL.

For the last two ways, if there is no outer mask, we use an all-true boolean
vector for the mask parameter of the VP intrinsics. (We do not yet support
control flow in the loop body.)

We have also extended VPlan to add new recipes for `PREDICATED-WIDENING` of
arithmetic operations and memory operations, and a recipe to emit instructions
for computing EVL. Using VPlan in this way will eventually help build and
compare VPlans corresponding to different strategies and alternatives.

Alternate vectorization strategies with predication
===================================================

Other than the tail-folding based vectorization strategy, we are considering two
other vectorization strategies (not implemented yet):

1. Non-predicated body followed by a predicated vectorized tail - This will generate a vector body without any predication (except control flow), same as for the existing approach of a vector body with scalar tail loop. The tail however will be vectorized using the VP intrinsics with `EVL = trip_count % VF`. While this approach will result in larger code size, it might be more efficient than our currently implemented approach; It will have a straight line code for the tail and the vector body will be free of all the overhead of using intrinsics.

2. Another strategy could be to use tail-folding based approach but use predication only for memory operations. This might be beneficial for architectures like Power VSX that support vector length predication only for memory operations.

Caveats / Current limitations / Current status of things
========================================================

This patch is far from complete and simply meant to be a proof-of-concept with
the aim to (it will also be broken into smaller more concrete patches once we
have more feedback from the community):

- demonstrate the feasibility of the Loop Vectorizer to target VP intrinsics.
- start a deeper implementation-backed discussion around vector predication support in LLVM.

That being said, there are several limitations at the moment; Some need more
supporting implementation and some need more discussion:

- For the purpose of demonstration, we use a command line switch that can be used to force VP intrinsic support and needs tail-folding enabled to work.
- VP Intrinsic development is going on in parallel, and currently only supports integer arithmetic intrinsics in the upstream.
- No support for control flow in the loop.
- No support for interleaving.
- We need more discussion around the best approach for computing EVL parameter. If using an intrinsic, more thought needs to go into its semantics. Also the VPlan recipe for EVL is sort of a dummy recipe with widening delegated to the vectorizer.
- We also do not use the `active.vector.lane.mask` intrinsic yet, but it is something we consider for the future.
- No support for scalable vectors yet (Due to missing support for tail folding for scalable vectors.)

Note: If you are interested in how it may work end-to-end for scalable vectors,
do take a look at our [downstream implementation for RISC-V][RVV-Impl] and an
end-to-end [demo on Godot compiler explorer][Demo].

Note: This patch also includes our implementation of `vp_load` and `vp_store`
intrinsics. There is currently a more complete [patch][D99355 <https://reviews.llvm.org/D99355>] open for review,
which we will merge when it lands.

Tentative Development Roadmap
=============================

Our plan is to start with integrating the functionality in this patch, with
changes/enhancements agreed upon by the community. For next steps, we want to:

- Support VP intrinsics for vectorization for scalable vectors (starting with enabling tail folding for scalable vectors if required by the time.)
- Support for floating point operations.
- Support for control flow in the loop.
- Support for more complicated loops - reductions, inductions, recurrences, reverse.

[RVV-Impl]: https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi
[Demo]: https://repo.hca.bsc.es/epic/z/9eYRIF
[D99355 <https://reviews.llvm.org/D99355>]: https://reviews.llvm.org/D99355

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D99750

Files:
  llvm/include/llvm/Analysis/TargetTransformInfo.h
  llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
  llvm/include/llvm/IR/IRBuilder.h
  llvm/include/llvm/IR/Intrinsics.td
  llvm/lib/Analysis/TargetTransformInfo.cpp
  llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
  llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
  llvm/lib/Transforms/Vectorize/VPlan.cpp
  llvm/lib/Transforms/Vectorize/VPlan.h
  llvm/lib/Transforms/Vectorize/VPlanValue.h
  llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics.ll
  llvm/test/Transforms/LoopVectorize/vplan-vp-intrinsics.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D99750.334753.patch
Type: text/x-patch
Size: 70403 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20210401/849ed23a/attachment.bin>