[cfe-dev] [llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Renato Golin via cfe-dev cfe-dev at lists.llvm.org
Mon Jun 5 09:55:28 PDT 2017


Hi Graham,

Just making sure some people who voiced concerns are copied + cfe-dev.

On 1 June 2017 at 15:22, Graham Hunter via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> Hi,
>
> Here's the updated RFC for representing scalable vector types and associated constants in IR. I added a section to address questions that came up on the recent patch review.
>
> -Graham
>
> ===================================================
> Supporting Scalable Vector Architectures in LLVM IR
> ===================================================
>
> ==========
> Background
> ==========
>
> *ARMv8-A Scalable Vector Extensions* (SVE) is a vector ISA extension for
> AArch64, featuring lane predication, scatter-gather, and speculative loads. It
> is intended to scale with hardware such that the same binary running on a
> processor with longer vector registers can take advantage of the increased
> compute power without recompilation.
>
> As the vector length is no longer a statically known value, the way in which the
> LLVM vectorizer generates code required modifications such that certain values
> are now runtime evaluated expressions. This document describes changes proposed
> to LLVM that allow its vectorizer to better target SVE.
>
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>
> =====
> Types
> =====
>
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class. Most code that deals with vectors doesn't need
> to know the exact length, but does need to know relative lengths -- e.g. get
> a vector with the same number of elements but a different element type, or with
> half or double the number of elements.
>
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
>
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
>
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.

Does that mean that it is guaranteed that a division / multiplication
will only touch the minimum number of elements?

The cases I imagine could happen:

1. <n x 4 x i32> / 2 = <n x 2 x i32>: This is half of the actual vector length.
  - does the predicate vector limit the end tail (half-scale)?
  - does it just pack more into the same vector (and update the
predicate computation)?
  - because it would have half the bytes, I assume the former, but in
theory, we could probably merge and get twice the performance, no?
  - or would this insert undefs in between? Say, a 2-way reduction:
    - <v1, v2, v3, v4, v5, v6, v7, v8> -> <v1+v2, v3+v4, v5+v6, v7+v8>
    - In SIMD, this would use a smaller/less registers, but SVE is <n
x 128> right?
    - Do you pack the next vector into the previous <1+2, 3+4, 5+6,
7+8, 9+10, ...>?
    - Or do you insert undefs?
      - <1+2, 3+4, 5+6, 7+8, undef, undef, undef, undef>
      - <1+2, undef, 3+4, undef, 5+6, undef, 7+8, undef>
  - does it matter? AFAIK, SVE only has reductions to scalar, but this is IR.
  - can it be target-specific? If not, should it be made illegal?

2. <n x 4 x i32> * 2 = <n x 8 x i32>: This is twice of the actual vector length.
  - Would this be resolved by the target? If so, how?
  - Non-scalable vectors are just done one after the other
  - But scalable vectors have no known end-tail:
    - Creating two <n x 4 x i32> may interpolate the original data, either:
      - <v1, v2, v3, v4> <v5, v6, v7, v8>, ...
      - <vn+1, vn+2, vn+3, vn+4> <vn+5, vn+6, vn+7, vn+8>...
   - or:
      - <v1, v2, v3, v4> <vn+1, vn+2, vn+3, vn+4> ...
      - <v5, v6, v7, v8> <vn+5, vn+6, vn+7, vn+8>...
  - the first would makes more sense, but I'm not sure the scalar
evolution would work that way out of the box

3. Widening / Narrowing should use the existing syntax, I think.


>
> This mixture of static and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.
>
> IR Textual Form
> ---------------
>
> The textual form for a scalable vector is:
>
> ``<[n x ]<m> x <type>>``
>
> where `type` is the scalar type of each element, `m` is the minimum number of
> elements, and the string literal `n x` indicates that the total number of
> elements is an unknown multiple of `m`; `n` is just an arbitrary choice for
> indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `n x` is omitted, so there is no change in the
> format for existing vectors.
>
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same:
>
> ``<n x 4 x i32>`` and ``<n x 4 x i8>`` have the same number of elements.
>
> ``<n x 4 x i32>`` and ``<n x 8 x i16>`` have the same number of bytes.
>
> IR Bitcode Form
> ---------------
>
> To serialize scalable vectors to bitcode, a new boolean field is added to the
> type record. If the field is not present the type will default to a fixed-length
> vector type, preserving backwards compatibility.
>
> Alternatives Considered
> -----------------------
>
> We had two alternatives in mind -- a dedicated target specific type, and a
> subclass inheriting from VectorType.
>
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation.
>
> A subclass would take care of part of this, but would need still checks for
> whether a VectorType was scalable in each place a new VectorType was required.
>
> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.
>
> =================
> Runtime Constants
> =================
>
> With a scalable vector type defined, we now need a way to generate addresses for
> consecutive vector values in memory and to be able to create basic constant
> vector values.
>
> For address generation, the `vscale` constant is added to represent the runtime
> value of `n` in `<n x m x type>`. Multiplying `vscale` by `m` and the number of
> bytes in `type` gives the total length of a scalable vector, and the backend
> can pattern match to the various load and store instructions in SVE that
> automatically scale with vector length.

How would this work in scalar evolution?

In a <4 x i32> loop, I know the block is 128-bits wide, so I can
predict loop carried dependencies, static ranges, etc.

IIUC, dependencies can still be avoided by predicating access to N-1
elements, but would short static ranges will have to be kept as a
loop, because unrolling is not beneficial if the run-time length is
larger than the unroll factor.

Actually, I can't think how you could possibly unroll anything with
this. I imagine two run-time checks + two SVE operations (+ predicate
fiddling) would be worse than a short sequence of 4 independent NEON
instructions, if the unroll factor is slightly larger than one
run-time vector.

This example is particular to SVE and pathological cases, but I worry
that there may be lots of cases where the new notation will make it
difficult to decide. Still, only a problem to targets that do ask for
SVE, so people are aware of the trade-offs.



> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used as with fixed-length vectors for a constant zero splat. This can
> then be combined with `insertelement` and `shufflevector` to create arbitrary
> value splats in the same manner as fixed-length vectors.
>
> For constants consisting of a sequence of values, the `stepvector` constant is
> added to represent a simple constant of the form `<0, 1, 2... num_elems-1>`. To
> change the starting value a splat of the new start can be added, and changing
> the step requires multiplying by a splat.
>
> Alternatives Considered
> -----------------------
>
> We have chosen to represent these as constants to make pattern matching and
> constant folding easy, particularly for the mask argument of the
> `shufflevector` instruction.
>
> Another possibility is to use intrinsics similar to the old instructions, but
> currently these cannot be treated as constants. We wouldn't mind using
> intrinsics here, but would want to discuss how best to support constant folding
> and pattern matching without needing to add a lot more code.
>
> =======
> Example
> =======
>
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and stepvector
> are used to create the needed values and to advance the index variable in the
> loop.
>
> C Code
> ------
>
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
>
> Scalable IR Vector Body
> -----------------------
>
> ``
> vector.body:                                      ; preds = %vector.body.preheader, %vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <n x 4 x i32> [ %step.add8, %vector.body ],
>                                 [ stepvector, %vector.body.preheader ]
>   %1 = add i64 %0, mul (i64 vscale, i64 4)
>
>            ;; vscale splat used to increment identity vector ;;
>   %step.add8 = add <n x 4 x i32> %vec.ind7, shufflevector (<n x 4 x i32> insertelement (<n x 4 x i32> undef,
>                                                                                         i32 mul (i32 vscale, i32 4), i32 0),
>                                                                                         <n x 4 x i32> undef,
>                                                                                         <n x 4 x i32> zeroinitializer)

This syntax really gets me. Compare this with:

  %a = add <4 x i32> %b, <i32 4, i32 4, i32 4, i32 4>

It's going to be hard to infer anything from that syntax, which means
the resulting loop will be, for most purposes, as opaque as having an
intrinsic in there.


>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <n x 4 x i32>*
>   store <n x 4 x i32> %vec.ind7, <n x 4 x i32>* %3, align 4, !tbaa !1
>
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 vscale, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
>
> =====================
> Questions and Answers
> =====================
>
> Can the vector length change at runtime?
> ----------------------------------------
>
> It is possible to change vector length at runtime, but there is no model defined
> for how to resolve all the issues that arise when doing so. From the compiler's
> point of view, it is expected that vector length can be treated as constant but
> unknown.
>
> Since there's currently no way to indicate to the compiler where a change might
> occur, a programmer deliberately requesting changes to the vector length in the
> middle of a function containing autovectorized code would be considered a bug.

User error. Yes.


> If changing the VL at runtime becomes desireable at some point in the future,
> the types and constants presented in this design would not need to change; we
> would need some way of creating a barrier between sections of IR which might
> have a different vscale, but that would be an addition to the current design
> instead of a change.

No changes to IR, AFAIK. Just on how we use it.


> How do we spill/fill scalable registers on the stack?
> -----------------------------------------------------
>
> SVE registers have a (partially) unknown size at build time and their associated
> fill/spill instructions require an offset that is implicitly scaled by the
> vector length instead of bytes or element size. To accommodate this we
> created the concept of Stack Regions that are areas on the stack associated
> with specific data types or register classes.
>
> MachineFrameInfo has been extended with a list of Stack Regions that each
> contain a list of associated objects. Each Stack Region controls its own
> allocation, deallocation and internal offset calculations. For SVE we create
> separate regions for data and predicate registers, so the exact layout does
> not need to be known ahead of time, just relative offsets within regions.

Can this be mapped into IR address spaces?

If some target make use of changing VL as you describe above, this
will break catastrophically, I imagine.


> Objects not belonging to a Stack Region use the default handling, so other
> existing targets are not affected.
>
> A more complete design for this will be detailed in a dedicated RFC later.

I think this was a contentious enough point that might need a peek
into that RFC.

Targets that don't have SVE will not suffer from changes in
MachineFrameInfo, but AArch64 may, and I'm curious on how that's
solved.

cheers,
--renato



More information about the cfe-dev mailing list