[llvm-dev] [RFC][SVE] Supporting Scalable Vector Architectures in LLVM IR (take 2)

Wed Jun 28 09:48:30 PDT 2017

I’m sorry for the delay Renato.  I’ll take a look and get back to you in the next couple of days

-Chris

> On Jun 26, 2017, at 7:01 AM, Renato Golin via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> Chris, Chandler, ping. Any comments?
> 
> On 7 June 2017 at 13:52, Graham Hunter <Graham.Hunter at arm.com> wrote:
>> Hi Renato,
>> 
>> Thanks for taking a look. Answers inline below, let me know if I've missed something out.
>> 
>> -Graham
>> 
>> 
>>> On 5 Jun 2017, at 17:55, Renato Golin <renato.golin at linaro.org> wrote:
>>> 
>>> Hi Graham,
>>> 
>>> Just making sure some people who voiced concerns are copied + cfe-dev.
>>> 
>>> On 1 June 2017 at 15:22, Graham Hunter via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>>> Hi,
>>>> 
>>>> Here's the updated RFC for representing scalable vector types and associated constants in IR. I added a section to address questions that came up on the recent patch review.
>>>> 
>>>> -Graham
>>>> 
>>>> ===================================================
>>>> Supporting Scalable Vector Architectures in LLVM IR
>>>> ===================================================
>>>> 
>>>> ==========
>>>> Background
>>>> ==========
>>>> 
>>>> *ARMv8-A Scalable Vector Extensions* (SVE) is a vector ISA extension for
>>>> AArch64, featuring lane predication, scatter-gather, and speculative loads. It
>>>> is intended to scale with hardware such that the same binary running on a
>>>> processor with longer vector registers can take advantage of the increased
>>>> compute power without recompilation.
>>>> 
>>>> As the vector length is no longer a statically known value, the way in which the
>>>> LLVM vectorizer generates code required modifications such that certain values
>>>> are now runtime evaluated expressions. This document describes changes proposed
>>>> to LLVM that allow its vectorizer to better target SVE.
>>>> 
>>>> Documentation for SVE can be found at
>>>> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
>>>> 
>>>> =====
>>>> Types
>>>> =====
>>>> 
>>>> To represent a vector of unknown length a boolean `Scalable` property has been
>>>> added to the `VectorType` class. Most code that deals with vectors doesn't need
>>>> to know the exact length, but does need to know relative lengths -- e.g. get
>>>> a vector with the same number of elements but a different element type, or with
>>>> half or double the number of elements.
>>>> 
>>>> In order to allow code to transparently support scalable vectors, we introduce
>>>> an `ElementCount` class with two members:
>>>> 
>>>> - `unsigned Min`: the minimum number of elements.
>>>> - `bool Scalable`: is the element count an unknown multiple of `Min`?
>>>> 
>>>> For non-scalable vectors (``Scalable=false``) the scale is considered to be
>>>> equal to one and thus `Min` represents the exact number of elements in the
>>>> vector.
>>>> 
>>>> The intent for code working with vectors is to use convenience methods and avoid
>>>> directly dealing with the number of elements. If needed, calling
>>>> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
>>>> to obtain the (potentially scalable) number of elements. Overloaded division and
>>>> multiplication operators allow an ElementCount instance to be used in much the
>>>> same manner as an integer for most cases.
>>> 
>>> Does that mean that it is guaranteed that a division / multiplication
>>> will only touch the minimum number of elements?
>>> 
>>> The cases I imagine could happen:
>>> 
>>> 1. <n x 4 x i32> / 2 = <n x 2 x i32>: This is half of the actual vector length.
>>> - does the predicate vector limit the end tail (half-scale)?
>>> - does it just pack more into the same vector (and update the
>>> predicate computation)?
>>> - because it would have half the bytes, I assume the former, but in
>>> theory, we could probably merge and get twice the performance, no?
>>> - or would this insert undefs in between? Say, a 2-way reduction:
>>>   - <v1, v2, v3, v4, v5, v6, v7, v8> -> <v1+v2, v3+v4, v5+v6, v7+v8>
>>>   - In SIMD, this would use a smaller/less registers, but SVE is <n
>>> x 128> right?
>>>   - Do you pack the next vector into the previous <1+2, 3+4, 5+6,
>>> 7+8, 9+10, ...>?
>>>   - Or do you insert undefs?
>>>     - <1+2, 3+4, 5+6, 7+8, undef, undef, undef, undef>
>>>     - <1+2, undef, 3+4, undef, 5+6, undef, 7+8, undef>
>>> - does it matter? AFAIK, SVE only has reductions to scalar, but this is IR.
>>> - can it be target-specific? If not, should it be made illegal?
>>> 
>>> 2. <n x 4 x i32> * 2 = <n x 8 x i32>: This is twice of the actual vector length.
>>> - Would this be resolved by the target? If so, how?
>>> - Non-scalable vectors are just done one after the other
>>> - But scalable vectors have no known end-tail:
>>>   - Creating two <n x 4 x i32> may interpolate the original data, either:
>>>     - <v1, v2, v3, v4> <v5, v6, v7, v8>, ...
>>>     - <vn+1, vn+2, vn+3, vn+4> <vn+5, vn+6, vn+7, vn+8>...
>>>  - or:
>>>     - <v1, v2, v3, v4> <vn+1, vn+2, vn+3, vn+4> ...
>>>     - <v5, v6, v7, v8> <vn+5, vn+6, vn+7, vn+8>...
>>> - the first would makes more sense, but I'm not sure the scalar
>>> evolution would work that way out of the box
>>> 
>>> 3. Widening / Narrowing should use the existing syntax, I think.
>> 
>> Yes, it will only operate on the correct number of elements.
>> 
>> So SelectionDAG legalization of target-unsupported IR types happens in
>> exactly the same way it does for fixed-length types, e.g.
>> * The <n x 8 x i32> above would be split into two <n x 4 x i32> values.
>> * The <n x 2 x i32> above would be extended to an <n x 2 x i64> value.
>> 
>> When splitting, the linear order of elements would be preserved, so given a
>> vector like <1, 2, 3, ... 2n> you would end up with
>> <1, 2, ... n> and <n+1, n+2, ... 2n>.
>> If you need a different arrangement, the shufflevector instruction can be used
>> with appropriate use of stepvector and vscale to generate masks.
>> 
>> The IR predicates for a given scalable vector (for select) just match the
>> number of elements, e.g. <n x 8 x i1> for the <n x 8 x i32>, and would be
>> split along with it during legalization.
>> 
>> As you say, the main legal types for SVE are based around multiples of
>> 128b base types, but for some types we've had to use predication to help.
>> An <n x 2 x f32> can't be extended to use f64s, but when generating code
>> we can create a predicate for 64b element types and use that with 32b float
>> instructions, so that the vector interleaves f32 values with 32bit undefs
>> and gives you an effective <n x 2 x f32> vector.
>> 
>> For horizontal pair operations, you would use shuffle instructions to align
>> the operands in matching lanes for two vectors and then perform the operation.
>>> 
>>> 
>>>> 
>>>> This mixture of static and runtime quantities allow us to reason about the
>>>> relationship between different scalable vector types without knowing their
>>>> exact length.
>>>> 
>>>> IR Textual Form
>>>> ---------------
>>>> 
>>>> The textual form for a scalable vector is:
>>>> 
>>>> ``<[n x ]<m> x <type>>``
>>>> 
>>>> where `type` is the scalar type of each element, `m` is the minimum number of
>>>> elements, and the string literal `n x` indicates that the total number of
>>>> elements is an unknown multiple of `m`; `n` is just an arbitrary choice for
>>>> indicating that the vector is scalable, and could be substituted by another.
>>>> For fixed-length vectors, the `n x` is omitted, so there is no change in the
>>>> format for existing vectors.
>>>> 
>>>> Scalable vectors with the same `Min` value have the same number of elements, and
>>>> the same number of bytes if `Min * sizeof(type)` is the same:
>>>> 
>>>> ``<n x 4 x i32>`` and ``<n x 4 x i8>`` have the same number of elements.
>>>> 
>>>> ``<n x 4 x i32>`` and ``<n x 8 x i16>`` have the same number of bytes.
>>>> 
>>>> IR Bitcode Form
>>>> ---------------
>>>> 
>>>> To serialize scalable vectors to bitcode, a new boolean field is added to the
>>>> type record. If the field is not present the type will default to a fixed-length
>>>> vector type, preserving backwards compatibility.
>>>> 
>>>> Alternatives Considered
>>>> -----------------------
>>>> 
>>>> We had two alternatives in mind -- a dedicated target specific type, and a
>>>> subclass inheriting from VectorType.
>>>> 
>>>> A dedicated target type would either need to extend all existing passes that
>>>> work with vectors to recognize the new type, or to duplicate all that code
>>>> in order to get reasonable code generation.
>>>> 
>>>> A subclass would take care of part of this, but would need still checks for
>>>> whether a VectorType was scalable in each place a new VectorType was required.
>>>> 
>>>> Although our current solution will need to change some of the code that creates
>>>> new VectorTypes, much of that code doesn't need to care about whether the types
>>>> are scalable or not -- they can use preexisting methods like
>>>> `getHalfElementsVectorType`. If the code is a little more complex,
>>>> `ElementCount` structs can be used instead of an `unsigned` value to represent
>>>> the number of elements.
>>>> 
>>>> =================
>>>> Runtime Constants
>>>> =================
>>>> 
>>>> With a scalable vector type defined, we now need a way to generate addresses for
>>>> consecutive vector values in memory and to be able to create basic constant
>>>> vector values.
>>>> 
>>>> For address generation, the `vscale` constant is added to represent the runtime
>>>> value of `n` in `<n x m x type>`. Multiplying `vscale` by `m` and the number of
>>>> bytes in `type` gives the total length of a scalable vector, and the backend
>>>> can pattern match to the various load and store instructions in SVE that
>>>> automatically scale with vector length.
>>> 
>>> How would this work in scalar evolution?
>>> 
>>> In a <4 x i32> loop, I know the block is 128-bits wide, so I can
>>> predict loop carried dependencies, static ranges, etc.
>>> 
>>> IIUC, dependencies can still be avoided by predicating access to N-1
>>> elements, but would short static ranges will have to be kept as a
>>> loop, because unrolling is not beneficial if the run-time length is
>>> larger than the unroll factor.
>>> 
>>> Actually, I can't think how you could possibly unroll anything with
>>> this. I imagine two run-time checks + two SVE operations (+ predicate
>>> fiddling) would be worse than a short sequence of 4 independent NEON
>>> instructions, if the unroll factor is slightly larger than one
>>> run-time vector.
>>> 
>>> This example is particular to SVE and pathological cases, but I worry
>>> that there may be lots of cases where the new notation will make it
>>> difficult to decide. Still, only a problem to targets that do ask for
>>> SVE, so people are aware of the trade-offs.
>> 
>> There's many more cases for SVE where vectors 'may' overlap, since we don't
>> have an exact size; for now, comparing size expressions featuring vscale
>> will be overly cautious and generate very large ranges. This will cause
>> extra checks to be planted for vectorized code that may not be needed,
>> but will get us running vectorized code safely.
>> 
>> Eventually, we'd like to upstream our work on controlling loops via
>> predication, but that is a separate discussion (and one that will affect
>> more people, since there's at least AVX-512 with a similar feature). For
>> now we'd just like to get basic SVE support in.
>> 
>> Unrolling and vectorizing does indeed have higher overhead when using VLA
>> programming, so the cost model may need to be adjusted to account for that.
>> We made sure it works for our downstream compiler but haven't used it for
>> performance modelling.
>> 
>> 
>>> 
>>> 
>>> 
>>>> For constant vector values, we cannot specify all the elements as we can for
>>>> fixed-length vectors; fortunately only a small number of easily synthesized
>>>> patterns are required for autovectorization. The `zeroinitializer` constant
>>>> can be used as with fixed-length vectors for a constant zero splat. This can
>>>> then be combined with `insertelement` and `shufflevector` to create arbitrary
>>>> value splats in the same manner as fixed-length vectors.
>>>> 
>>>> For constants consisting of a sequence of values, the `stepvector` constant is
>>>> added to represent a simple constant of the form `<0, 1, 2... num_elems-1>`. To
>>>> change the starting value a splat of the new start can be added, and changing
>>>> the step requires multiplying by a splat.
>>>> 
>>>> Alternatives Considered
>>>> -----------------------
>>>> 
>>>> We have chosen to represent these as constants to make pattern matching and
>>>> constant folding easy, particularly for the mask argument of the
>>>> `shufflevector` instruction.
>>>> 
>>>> Another possibility is to use intrinsics similar to the old instructions, but
>>>> currently these cannot be treated as constants. We wouldn't mind using
>>>> intrinsics here, but would want to discuss how best to support constant folding
>>>> and pattern matching without needing to add a lot more code.
>>>> 
>>>> =======
>>>> Example
>>>> =======
>>>> 
>>>> The following example shows a simple C loop which assigns the array index to
>>>> the array elements matching that index. The IR shows how vscale and stepvector
>>>> are used to create the needed values and to advance the index variable in the
>>>> loop.
>>>> 
>>>> C Code
>>>> ------
>>>> 
>>>> ``
>>>> void IdentityArrayInit(int *a, int count) {
>>>> for (int i = 0; i < count; ++i)
>>>>   a[i] = i;
>>>> }
>>>> ``
>>>> 
>>>> Scalable IR Vector Body
>>>> -----------------------
>>>> 
>>>> ``
>>>> vector.body:                                      ; preds = %vector.body.preheader, %vector.body
>>>> %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>>>> %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
>>>> 
>>>>          ;; stepvector used for index identity on entry to loop body ;;
>>>> %vec.ind7 = phi <n x 4 x i32> [ %step.add8, %vector.body ],
>>>>                               [ stepvector, %vector.body.preheader ]
>>>> %1 = add i64 %0, mul (i64 vscale, i64 4)
>>>> 
>>>>          ;; vscale splat used to increment identity vector ;;
>>>> %step.add8 = add <n x 4 x i32> %vec.ind7, shufflevector (<n x 4 x i32> insertelement (<n x 4 x i32> undef,
>>>>                                                                                       i32 mul (i32 vscale, i32 4), i32 0),
>>>>                                                                                       <n x 4 x i32> undef,
>>>>                                                                                       <n x 4 x i32> zeroinitializer)
>>> 
>>> This syntax really gets me. Compare this with:
>>> 
>>> %a = add <4 x i32> %b, <i32 4, i32 4, i32 4, i32 4>
>>> 
>>> It's going to be hard to infer anything from that syntax, which means
>>> the resulting loop will be, for most purposes, as opaque as having an
>>> intrinsic in there.
>> 
>> It looks nice for fixed-length when the element values are immediate constants.
>> If instead of '4' it was a value from an argument or loaded from memory, then it
>> would look similar -- that's how splats are represented in IR, and using an
>> intrinsic wouldn't change that. Existing optimizations will recognize the
>> splat in this form.
>> 
>>> 
>>> 
>>>> %2 = getelementptr inbounds i32, i32* %a, i64 %0
>>>> %3 = bitcast i32* %2 to <n x 4 x i32>*
>>>> store <n x 4 x i32> %vec.ind7, <n x 4 x i32>* %3, align 4, !tbaa !1
>>>> 
>>>>          ;; vscale used to increment loop index
>>>> %index.next = add i64 %index, mul (i64 vscale, i64 4)
>>>> %4 = icmp eq i64 %index.next, %n.vec
>>>> br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
>>>> ``
>>>> 
>>>> =====================
>>>> Questions and Answers
>>>> =====================
>>>> 
>>>> Can the vector length change at runtime?
>>>> ----------------------------------------
>>>> 
>>>> It is possible to change vector length at runtime, but there is no model defined
>>>> for how to resolve all the issues that arise when doing so. From the compiler's
>>>> point of view, it is expected that vector length can be treated as constant but
>>>> unknown.
>>>> 
>>>> Since there's currently no way to indicate to the compiler where a change might
>>>> occur, a programmer deliberately requesting changes to the vector length in the
>>>> middle of a function containing autovectorized code would be considered a bug.
>>> 
>>> User error. Yes.
>>> 
>>> 
>>>> If changing the VL at runtime becomes desireable at some point in the future,
>>>> the types and constants presented in this design would not need to change; we
>>>> would need some way of creating a barrier between sections of IR which might
>>>> have a different vscale, but that would be an addition to the current design
>>>> instead of a change.
>>> 
>>> No changes to IR, AFAIK. Just on how we use it.
>>> 
>>> 
>>>> How do we spill/fill scalable registers on the stack?
>>>> -----------------------------------------------------
>>>> 
>>>> SVE registers have a (partially) unknown size at build time and their associated
>>>> fill/spill instructions require an offset that is implicitly scaled by the
>>>> vector length instead of bytes or element size. To accommodate this we
>>>> created the concept of Stack Regions that are areas on the stack associated
>>>> with specific data types or register classes.
>>>> 
>>>> MachineFrameInfo has been extended with a list of Stack Regions that each
>>>> contain a list of associated objects. Each Stack Region controls its own
>>>> allocation, deallocation and internal offset calculations. For SVE we create
>>>> separate regions for data and predicate registers, so the exact layout does
>>>> not need to be known ahead of time, just relative offsets within regions.
>>> 
>>> Can this be mapped into IR address spaces?
>> 
>> I don't think these need to map into separate address spaces; we haven't done
>> any investigation along those lines as the default single address space works
>> fine with this.
>> 
>>> 
>>> If some target make use of changing VL as you describe above, this
>>> will break catastrophically, I imagine.
>> 
>> Yes; there's a great many problems with changing VL at runtime, which is
>> why we consider it a bug right now. I'm not aware of anyone actually asking
>> for the functionality to be supported.
>> 
>>> 
>>> 
>>>> Objects not belonging to a Stack Region use the default handling, so other
>>>> existing targets are not affected.
>>>> 
>>>> A more complete design for this will be detailed in a dedicated RFC later.
>>> 
>>> I think this was a contentious enough point that might need a peek
>>> into that RFC.
>>> 
>>> Targets that don't have SVE will not suffer from changes in
>>> MachineFrameInfo, but AArch64 may, and I'm curious on how that's
>>> solved.
>> 
>> Noted. We'll try and schedule some time to write it up soonish.
>> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev