[llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Mon Jun 26 07:01:54 PDT 2017

Chris, Chandler, ping. Any comments?

On 7 June 2017 at 13:52, Graham Hunter <Graham.Hunter at arm.com> wrote:
> Hi Renato,
>
> Thanks for taking a look. Answers inline below, let me know if I've missed something out.
>
> -Graham
>
>
>> On 5 Jun 2017, at 17:55, Renato Golin <renato.golin at linaro.org> wrote:
>>
>> Hi Graham,
>>
>> Just making sure some people who voiced concerns are copied + cfe-dev.
>>
>> On 1 June 2017 at 15:22, Graham Hunter via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>>> Hi,
>>>
>>> Here's the updated RFC for representing scalable vector types and associated constants in IR. I added a section to address questions that came up on the recent patch review.
>>>
>>> -Graham
>>>
>>> ===================================================
>>> Supporting Scalable Vector Architectures in LLVM IR
>>> ===================================================
>>>
>>> ==========
>>> Background
>>> ==========
>>>
>>> *ARMv8-A Scalable Vector Extensions* (SVE) is a vector ISA extension for
>>> AArch64, featuring lane predication, scatter-gather, and speculative loads. It
>>> is intended to scale with hardware such that the same binary running on a
>>> processor with longer vector registers can take advantage of the increased
>>> compute power without recompilation.
>>>
>>> As the vector length is no longer a statically known value, the way in which the
>>> LLVM vectorizer generates code required modifications such that certain values
>>> are now runtime evaluated expressions. This document describes changes proposed
>>> to LLVM that allow its vectorizer to better target SVE.
>>>
>>> Documentation for SVE can be found at
>>> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>>>
>>> =====
>>> Types
>>> =====
>>>
>>> To represent a vector of unknown length a boolean `Scalable` property has been
>>> added to the `VectorType` class. Most code that deals with vectors doesn't need
>>> to know the exact length, but does need to know relative lengths -- e.g. get
>>> a vector with the same number of elements but a different element type, or with
>>> half or double the number of elements.
>>>
>>> In order to allow code to transparently support scalable vectors, we introduce
>>> an `ElementCount` class with two members:
>>>
>>> - `unsigned Min`: the minimum number of elements.
>>> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>>>
>>> For non-scalable vectors (``Scalable=false``) the scale is considered to be
>>> equal to one and thus `Min` represents the exact number of elements in the
>>> vector.
>>>
>>> The intent for code working with vectors is to use convenience methods and avoid
>>> directly dealing with the number of elements. If needed, calling
>>> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
>>> to obtain the (potentially scalable) number of elements. Overloaded division and
>>> multiplication operators allow an ElementCount instance to be used in much the
>>> same manner as an integer for most cases.
>>
>> Does that mean that it is guaranteed that a division / multiplication
>> will only touch the minimum number of elements?
>>
>> The cases I imagine could happen:
>>
>> 1. <n x 4 x i32> / 2 = <n x 2 x i32>: This is half of the actual vector length.
>>  - does the predicate vector limit the end tail (half-scale)?
>>  - does it just pack more into the same vector (and update the
>> predicate computation)?
>>  - because it would have half the bytes, I assume the former, but in
>> theory, we could probably merge and get twice the performance, no?
>>  - or would this insert undefs in between? Say, a 2-way reduction:
>>    - <v1, v2, v3, v4, v5, v6, v7, v8> -> <v1+v2, v3+v4, v5+v6, v7+v8>
>>    - In SIMD, this would use a smaller/less registers, but SVE is <n
>> x 128> right?
>>    - Do you pack the next vector into the previous <1+2, 3+4, 5+6,
>> 7+8, 9+10, ...>?
>>    - Or do you insert undefs?
>>      - <1+2, 3+4, 5+6, 7+8, undef, undef, undef, undef>
>>      - <1+2, undef, 3+4, undef, 5+6, undef, 7+8, undef>
>>  - does it matter? AFAIK, SVE only has reductions to scalar, but this is IR.
>>  - can it be target-specific? If not, should it be made illegal?
>>
>> 2. <n x 4 x i32> * 2 = <n x 8 x i32>: This is twice of the actual vector length.
>>  - Would this be resolved by the target? If so, how?
>>  - Non-scalable vectors are just done one after the other
>>  - But scalable vectors have no known end-tail:
>>    - Creating two <n x 4 x i32> may interpolate the original data, either:
>>      - <v1, v2, v3, v4> <v5, v6, v7, v8>, ...
>>      - <vn+1, vn+2, vn+3, vn+4> <vn+5, vn+6, vn+7, vn+8>...
>>   - or:
>>      - <v1, v2, v3, v4> <vn+1, vn+2, vn+3, vn+4> ...
>>      - <v5, v6, v7, v8> <vn+5, vn+6, vn+7, vn+8>...
>>  - the first would makes more sense, but I'm not sure the scalar
>> evolution would work that way out of the box
>>
>> 3. Widening / Narrowing should use the existing syntax, I think.
>
> Yes, it will only operate on the correct number of elements.
>
> So SelectionDAG legalization of target-unsupported IR types happens in
> exactly the same way it does for fixed-length types, e.g.
>  * The <n x 8 x i32> above would be split into two <n x 4 x i32> values.
>  * The <n x 2 x i32> above would be extended to an <n x 2 x i64> value.
>
> When splitting, the linear order of elements would be preserved, so given a
> vector like <1, 2, 3, ... 2n> you would end up with
>  <1, 2, ... n> and <n+1, n+2, ... 2n>.
> If you need a different arrangement, the shufflevector instruction can be used
> with appropriate use of stepvector and vscale to generate masks.
>
> The IR predicates for a given scalable vector (for select) just match the
> number of elements, e.g. <n x 8 x i1> for the <n x 8 x i32>, and would be
> split along with it during legalization.
>
> As you say, the main legal types for SVE are based around multiples of
> 128b base types, but for some types we've had to use predication to help.
> An <n x 2 x f32> can't be extended to use f64s, but when generating code
> we can create a predicate for 64b element types and use that with 32b float
> instructions, so that the vector interleaves f32 values with 32bit undefs
> and gives you an effective <n x 2 x f32> vector.
>
> For horizontal pair operations, you would use shuffle instructions to align
> the operands in matching lanes for two vectors and then perform the operation.
>>
>>
>>>
>>> This mixture of static and runtime quantities allow us to reason about the
>>> relationship between different scalable vector types without knowing their
>>> exact length.
>>>
>>> IR Textual Form
>>> ---------------
>>>
>>> The textual form for a scalable vector is:
>>>
>>> ``<[n x ]<m> x <type>>``
>>>
>>> where `type` is the scalar type of each element, `m` is the minimum number of
>>> elements, and the string literal `n x` indicates that the total number of
>>> elements is an unknown multiple of `m`; `n` is just an arbitrary choice for
>>> indicating that the vector is scalable, and could be substituted by another.
>>> For fixed-length vectors, the `n x` is omitted, so there is no change in the
>>> format for existing vectors.
>>>
>>> Scalable vectors with the same `Min` value have the same number of elements, and
>>> the same number of bytes if `Min * sizeof(type)` is the same:
>>>
>>> ``<n x 4 x i32>`` and ``<n x 4 x i8>`` have the same number of elements.
>>>
>>> ``<n x 4 x i32>`` and ``<n x 8 x i16>`` have the same number of bytes.
>>>
>>> IR Bitcode Form
>>> ---------------
>>>
>>> To serialize scalable vectors to bitcode, a new boolean field is added to the
>>> type record. If the field is not present the type will default to a fixed-length
>>> vector type, preserving backwards compatibility.
>>>
>>> Alternatives Considered
>>> -----------------------
>>>
>>> We had two alternatives in mind -- a dedicated target specific type, and a
>>> subclass inheriting from VectorType.
>>>
>>> A dedicated target type would either need to extend all existing passes that
>>> work with vectors to recognize the new type, or to duplicate all that code
>>> in order to get reasonable code generation.
>>>
>>> A subclass would take care of part of this, but would need still checks for
>>> whether a VectorType was scalable in each place a new VectorType was required.
>>>
>>> Although our current solution will need to change some of the code that creates
>>> new VectorTypes, much of that code doesn't need to care about whether the types
>>> are scalable or not -- they can use preexisting methods like
>>> `getHalfElementsVectorType`. If the code is a little more complex,
>>> `ElementCount` structs can be used instead of an `unsigned` value to represent
>>> the number of elements.
>>>
>>> =================
>>> Runtime Constants
>>> =================
>>>
>>> With a scalable vector type defined, we now need a way to generate addresses for
>>> consecutive vector values in memory and to be able to create basic constant
>>> vector values.
>>>
>>> For address generation, the `vscale` constant is added to represent the runtime
>>> value of `n` in `<n x m x type>`. Multiplying `vscale` by `m` and the number of
>>> bytes in `type` gives the total length of a scalable vector, and the backend
>>> can pattern match to the various load and store instructions in SVE that
>>> automatically scale with vector length.
>>
>> How would this work in scalar evolution?
>>
>> In a <4 x i32> loop, I know the block is 128-bits wide, so I can
>> predict loop carried dependencies, static ranges, etc.
>>
>> IIUC, dependencies can still be avoided by predicating access to N-1
>> elements, but would short static ranges will have to be kept as a
>> loop, because unrolling is not beneficial if the run-time length is
>> larger than the unroll factor.
>>
>> Actually, I can't think how you could possibly unroll anything with
>> this. I imagine two run-time checks + two SVE operations (+ predicate
>> fiddling) would be worse than a short sequence of 4 independent NEON
>> instructions, if the unroll factor is slightly larger than one
>> run-time vector.
>>
>> This example is particular to SVE and pathological cases, but I worry
>> that there may be lots of cases where the new notation will make it
>> difficult to decide. Still, only a problem to targets that do ask for
>> SVE, so people are aware of the trade-offs.
>
> There's many more cases for SVE where vectors 'may' overlap, since we don't
> have an exact size; for now, comparing size expressions featuring vscale
> will be overly cautious and generate very large ranges. This will cause
> extra checks to be planted for vectorized code that may not be needed,
> but will get us running vectorized code safely.
>
> Eventually, we'd like to upstream our work on controlling loops via
> predication, but that is a separate discussion (and one that will affect
> more people, since there's at least AVX-512 with a similar feature). For
> now we'd just like to get basic SVE support in.
>
> Unrolling and vectorizing does indeed have higher overhead when using VLA
> programming, so the cost model may need to be adjusted to account for that.
> We made sure it works for our downstream compiler but haven't used it for
> performance modelling.
>
>
>>
>>
>>
>>> For constant vector values, we cannot specify all the elements as we can for
>>> fixed-length vectors; fortunately only a small number of easily synthesized
>>> patterns are required for autovectorization. The `zeroinitializer` constant
>>> can be used as with fixed-length vectors for a constant zero splat. This can
>>> then be combined with `insertelement` and `shufflevector` to create arbitrary
>>> value splats in the same manner as fixed-length vectors.
>>>
>>> For constants consisting of a sequence of values, the `stepvector` constant is
>>> added to represent a simple constant of the form `<0, 1, 2... num_elems-1>`. To
>>> change the starting value a splat of the new start can be added, and changing
>>> the step requires multiplying by a splat.
>>>
>>> Alternatives Considered
>>> -----------------------
>>>
>>> We have chosen to represent these as constants to make pattern matching and
>>> constant folding easy, particularly for the mask argument of the
>>> `shufflevector` instruction.
>>>
>>> Another possibility is to use intrinsics similar to the old instructions, but
>>> currently these cannot be treated as constants. We wouldn't mind using
>>> intrinsics here, but would want to discuss how best to support constant folding
>>> and pattern matching without needing to add a lot more code.
>>>
>>> =======
>>> Example
>>> =======
>>>
>>> The following example shows a simple C loop which assigns the array index to
>>> the array elements matching that index. The IR shows how vscale and stepvector
>>> are used to create the needed values and to advance the index variable in the
>>> loop.
>>>
>>> C Code
>>> ------
>>>
>>> ``
>>> void IdentityArrayInit(int *a, int count) {
>>>  for (int i = 0; i < count; ++i)
>>>    a[i] = i;
>>> }
>>> ``
>>>
>>> Scalable IR Vector Body
>>> -----------------------
>>>
>>> ``
>>> vector.body:                                      ; preds = %vector.body.preheader, %vector.body
>>>  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>>>  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>>>
>>>           ;; stepvector used for index identity on entry to loop body ;;
>>>  %vec.ind7 = phi <n x 4 x i32> [ %step.add8, %vector.body ],
>>>                                [ stepvector, %vector.body.preheader ]
>>>  %1 = add i64 %0, mul (i64 vscale, i64 4)
>>>
>>>           ;; vscale splat used to increment identity vector ;;
>>>  %step.add8 = add <n x 4 x i32> %vec.ind7, shufflevector (<n x 4 x i32> insertelement (<n x 4 x i32> undef,
>>>                                                                                        i32 mul (i32 vscale, i32 4), i32 0),
>>>                                                                                        <n x 4 x i32> undef,
>>>                                                                                        <n x 4 x i32> zeroinitializer)
>>
>> This syntax really gets me. Compare this with:
>>
>>  %a = add <4 x i32> %b, <i32 4, i32 4, i32 4, i32 4>
>>
>> It's going to be hard to infer anything from that syntax, which means
>> the resulting loop will be, for most purposes, as opaque as having an
>> intrinsic in there.
>
> It looks nice for fixed-length when the element values are immediate constants.
> If instead of '4' it was a value from an argument or loaded from memory, then it
> would look similar -- that's how splats are represented in IR, and using an
> intrinsic wouldn't change that. Existing optimizations will recognize the
> splat in this form.
>
>>
>>
>>>  %2 = getelementptr inbounds i32, i32* %a, i64 %0
>>>  %3 = bitcast i32* %2 to <n x 4 x i32>*
>>>  store <n x 4 x i32> %vec.ind7, <n x 4 x i32>* %3, align 4, !tbaa !1
>>>
>>>           ;; vscale used to increment loop index
>>>  %index.next = add i64 %index, mul (i64 vscale, i64 4)
>>>  %4 = icmp eq i64 %index.next, %n.vec
>>>  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>>> ``
>>>
>>> =====================
>>> Questions and Answers
>>> =====================
>>>
>>> Can the vector length change at runtime?
>>> ----------------------------------------
>>>
>>> It is possible to change vector length at runtime, but there is no model defined
>>> for how to resolve all the issues that arise when doing so. From the compiler's
>>> point of view, it is expected that vector length can be treated as constant but
>>> unknown.
>>>
>>> Since there's currently no way to indicate to the compiler where a change might
>>> occur, a programmer deliberately requesting changes to the vector length in the
>>> middle of a function containing autovectorized code would be considered a bug.
>>
>> User error. Yes.
>>
>>
>>> If changing the VL at runtime becomes desireable at some point in the future,
>>> the types and constants presented in this design would not need to change; we
>>> would need some way of creating a barrier between sections of IR which might
>>> have a different vscale, but that would be an addition to the current design
>>> instead of a change.
>>
>> No changes to IR, AFAIK. Just on how we use it.
>>
>>
>>> How do we spill/fill scalable registers on the stack?
>>> -----------------------------------------------------
>>>
>>> SVE registers have a (partially) unknown size at build time and their associated
>>> fill/spill instructions require an offset that is implicitly scaled by the
>>> vector length instead of bytes or element size. To accommodate this we
>>> created the concept of Stack Regions that are areas on the stack associated
>>> with specific data types or register classes.
>>>
>>> MachineFrameInfo has been extended with a list of Stack Regions that each
>>> contain a list of associated objects. Each Stack Region controls its own
>>> allocation, deallocation and internal offset calculations. For SVE we create
>>> separate regions for data and predicate registers, so the exact layout does
>>> not need to be known ahead of time, just relative offsets within regions.
>>
>> Can this be mapped into IR address spaces?
>
> I don't think these need to map into separate address spaces; we haven't done
> any investigation along those lines as the default single address space works
> fine with this.
>
>>
>> If some target make use of changing VL as you describe above, this
>> will break catastrophically, I imagine.
>
> Yes; there's a great many problems with changing VL at runtime, which is
> why we consider it a bug right now. I'm not aware of anyone actually asking
> for the functionality to be supported.
>
>>
>>
>>> Objects not belonging to a Stack Region use the default handling, so other
>>> existing targets are not affected.
>>>
>>> A more complete design for this will be detailed in a dedicated RFC later.
>>
>> I think this was a contentious enough point that might need a peek
>> into that RFC.
>>
>> Targets that don't have SVE will not suffer from changes in
>> MachineFrameInfo, but AArch64 may, and I'm curious on how that's
>> solved.
>
> Noted. We'll try and schedule some time to write it up soonish.
>