[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Sat Jul 7 06:46:25 PDT 2018

Hi Graham,

thanks again. The changes and additions all make sense to me, I just
have one minor comment about shufflevector.

On 2 July 2018 at 11:53, Graham Hunter <Graham.Hunter at arm.com> wrote:
> Hi,
>
> I've updated the RFC slightly based on the discussion within the thread, reposted below. Let me know if I've missed anything or if more clarification is needed.
>
> Thanks,
>
> -Graham
>
> =============================================================
> Supporting SIMD instruction sets with variable vector lengths
> =============================================================
>
> In this RFC we propose extending LLVM IR to support code-generation for variable
> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> approach is backwards compatible and should be as non-intrusive as possible; the
> only change needed in other backends is how size is queried on vector types, and
> it only requires a change in which function is called. We have created a set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7 of
> this rfc) can be found on Phabricator and are intended to illustrate the scope
> of changes required by the general approach described in this RFC.
>
> ==========
> Background
> ==========
>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of the
> increased compute power without recompilation.
>
> As the vector length is no longer a compile-time known value, the way in which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time constants.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> ========
> Contents
> ========
>
> The rest of this RFC covers the following topics:
>
> 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>    have a length which is a runtime-determined multiple of a known base length.
>
> 2. Size Queries - how to reason about the size of types for which the size isn't
>    fully known at compile time.
>
> 3. Representing the runtime multiple of vector length in IR for use in address
>    calculations and induction variable comparisons.
>
> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>    number of elements.
>
> 5. An explanation of splitting/concatentating scalable vectors.
>
> 6. A brief note on code generation of these new operations for AArch64.
>
> 7. An example of C code and matching IR using the proposed extensions.
>
> 8. A list of patches demonstrating the changes required to emit SVE instructions
>    for a loop that has already been vectorized using the extensions described
>    in this RFC.
>
> ========
> 1. Types
> ========
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class, which indicates that the number of elements in
> the vector is a runtime-determined integer multiple of the `NumElements` field.
> Most code that deals with vectors doesn't need to know the exact length, but
> does need to know relative lengths -- e.g. get a vector with the same number of
> elements but a different element type, or with half or double the number of
> elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.
>
> This mixture of compile-time and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> The runtime multiple is not expected to change during program execution for SVE,
> but it is possible. The model of scalable vectors presented in this RFC assumes
> that the multiple will be constant within a function but not necessarily across
> functions. As suggested in the recent RISC-V rfc, a new function attribute to
> inherit the multiple across function calls will allow for function calls with
> vector arguments/return values and inlining/outlining optimizations.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<scalable <n> x <type>>``
>
> where `type` is the scalar type of each element, `n` is the minimum number of
> elements, and the string literal `scalable` indicates that the total number of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
> for indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change in
> the format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
> used within the same function):
>
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>   elements.
>
> ``<scalable 4 x i32>`` and ``<scalable 8 x i16>`` have the same number of
>   bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to the
> type record. If the field is not present the type will default to a fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
>
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
>
> This hasn't been done for the x86_mmx type, and so it is only capable of
> providing support for C-level intrinsics instead of being used and recognized by
> passes inside llvm.
>
> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.
>
> ===============
> 2. Size Queries
> ===============
>
> This is a proposal for how to deal with querying the size of scalable types for
> analysis of IR. While it has not been implemented in full, the general approach
> works well for calculating offsets into structures with scalable types in a
> modified version of ComputeValueVTs in our downstream compiler.
>
> For current IR types that have a known size, all query functions return a single
> integer constant. For scalable types a second integer is needed to indicate the
> number of bytes/bits which need to be scaled by the runtime multiple to obtain
> the actual length.
>
> For primitive types, `getPrimitiveSizeInBits()` will function as it does today,
> except that it will no longer return a size for vector types (it will return 0,
> as it does for other derived types). The majority of calls to this function are
> already for scalar rather than vector types.
>
> For derived types, a function `getScalableSizePairInBits()` will be added, which
> returns a pair of integers (one to indicate unscaled bits, the other for bits
> that need to be scaled by the runtime multiple). For backends that do not need
> to deal with scalable types the existing methods will suffice, but a debug-only
> assert will be added to them to ensure they aren't used on scalable types.
>
> Similar functionality will be added to DataLayout.
>
> Comparisons between sizes will use the following methods, assuming that X and
> Y are non-zero integers and the form is of { unscaled, scaled }.
>
> { X, 0 } <cmp> { Y, 0 }: Normal unscaled comparison.
>
> { 0, X } <cmp> { 0, Y }: Normal comparison within a function, or across
>                          functions that inherit vector length. Cannot be
>                          compared across non-inheriting functions.
>
> { X, 0 } > { 0, Y }: Cannot return true.
>
> { X, 0 } = { 0, Y }: Cannot return true.
>
> { X, 0 } < { 0, Y }: Can return true.
>
> { Xu, Xs } <cmp> { Yu, Ys }: Gets complicated, need to subtract common
>                              terms and try the above comparisons; it
>                              may not be possible to get a good answer.
>
> It's worth noting that we don't expect the last case (mixed scaled and
> unscaled sizes) to occur. Richard Sandiford's proposed C extensions
> (http://lists.llvm.org/pipermail/cfe-dev/2018-May/057830.html) explicitly
> prohibits mixing fixed-size types into sizeless struct.
>
> I don't know if we need a 'maybe' or 'unknown' result for cases comparing scaled
> vs. unscaled; I believe the gcc implementation of SVE allows for such
> results, but that supports a generic polynomial length representation.
>
> My current intention is to rely on functions that clone or copy values to
> check whether they are being used to copy scalable vectors across function
> boundaries without the inherit vlen attribute and raise an error there instead
> of requiring passing the Function a type size is from for each comparison. If
> there's a strong preference for moving the check to the size comparison function
> let me know; I will be starting work on patches for this later in the year if
> there's no major problems with the idea.

Seems reasonable.

> Future Work
> -----------
>
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
>
> However, SVE's predication will mean that a dynamic 'safe' vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
>
> Alternatives Considered
> -----------------------
>
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns true
> and make use of the size when calculating offsets.
>
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'requiresSignExtension(Type*,Type*)'.
>
> ========================================
> 3. Representing Vector Length at Runtime
> ========================================
>
> With a scalable vector type defined, we now need a way to represent the runtime
> length in IR in order to generate addresses for consecutive vectors in memory
> and determine how many elements have been processed in an iteration of a loop.
>
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable vector.
>
> Fixed-Length Code
> -----------------
>
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %index.next = add i64 %index, 4
>   ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
>
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   ;; <check and branch>
> ``
> ===========================
> 4. Generating Vector Values
> ===========================
> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length vectors.
>
> For constants consisting of a sequence of values, an experimental `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.
>
> Fixed-Length Code
> -----------------
> ``
>   ;; Splat a value
>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>   ;; Add a constant sequence
>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>   ;; Splat a value
>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Splat offset + stride (the same in this case)
>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Create sequence for scalable vector
>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>   ;; Add the runtime-generated sequence
>   %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
>
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
>
> ===========================================
> 5. Splitting and Combining Scalable Vectors
> ===========================================
>
> Splitting and combining scalable vectors in IR is done in the same manner as
> for fixed-length vectors, but with a non-constant mask for the shufflevector.

It makes sense to have runtime-computed shuffle masks for some
architectures, especially those with runtime-variable vector lengths,
but lifting the current restriction that the shufflevector mask is a
constant affects all code that inspects the indices. There's lot such
code and as far as I've seen a fair amount of that code crucially
depends on the mask being constant. I'm not opposed to lifting the
restriction, but I want to call attention to it and double-check
everyone's okay with it because it seems like a big step and, unlike
other IR changes in this RFC, it isn't really necessary (we could also
use an intrinsic for these shuffles).

> The following is an example of splitting a <scalable 4 x double> into two
> separate <scalable 2 x double> values.
>
> ``
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   ;; Stepvector generates the element ids for first subvector
>   %sv1 = call <scalable 2 x i64> @llvm.experimental.vector.stepvector.nxv2i64()
>   ;; Add vscale * 2 to get the starting element for the second subvector
>   %ec = mul i64 %vscale64, 2
>   %ec.ins = insertelement <scalable 2 x i64> undef, i64 %ec, i32 0
>   %ec.splat = shufflevector <scalable 2 x i64> %9, <scalable 2 x i64> undef, <scalable 2 x i32> zeroinitializer
>   %sv2 = add <scalable 2 x i64> %ec.splat, %stepvec64
>   ;; Perform the extracts
>   %res1 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv1
>   %res2 = shufflevector <scalable 4 x double> %in, <scalable 4 x double> undef, <scalable 2 x i64> %sv2
> ``
>
> ==================
> 6. Code Generation
> ==================
>
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
>
> All three intrinsics are custom lowered and legalized in the AArch64 backend.
>
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
>
> GlobalISel
> ----------
>
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit was
> added to the raw_data representation for vectors and vectors of pointers.
>
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not be
> necessary in future, generating an all-true 'ptrue' value was necessary to
> convert a predicated instruction into an unpredicated one.
>
> ==========
> 7. Example
> ==========
>
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and stepvector
> are used to create the needed values and to advance the index variable in the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body.preheader:
>   ;; Other setup
>   ;; Stepvector used to create initial identity vector
>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>   br vector.body
>
> vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                      [ %stepvector, %vector.body.preheader ]
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %vscale32 = trunc i64 %vscale64 to i32
>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <scalable 4 x i32>*
>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> ==========
> 8. Patches
> ==========
>
> List of patches:
>
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
>