[llvm-dev] [EXT] Re: [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Tue May 28 02:46:04 PDT 2019

Hi JinGu,

> Above vectorized loop does not need tail loop. I guess we could map the %mask.vec to predicate register as native register class on ISelLowering level. The conditional branch could also be mapped to 'whilexx' and 'b.xxx on MIR level. In order to get vector type, we could calculate cost model for target as llvm's vectorizers. If SVE focuses on loop vectorization mainly, I am not sure why the scalarable vector type is needed... From my personal opinion, the VLA programming model could add ambiquity and complexity to compiler because it is not concrete type at compile time... I am not expert for SVE and VLA. I could miss something important. If I missed something, please let me know.

SVE doesn't have a prescribed vector length -- the size of a vector register is hardware dependent, and a single compiled VLA loop will execute correctly on hardware with different vector lengths. So if you ran the same code on a cpu with 128b SVE vectors and another with 256b vectors, you have the potential to double the work done per cycle on the second cpu without changing the code (there's lots of factors that could prevent performance from scaling nicely, but that's not directly related to the vector length). SVE's current maximum defined size is 2048b, though I suspect it'll be a quite a while before we see vectors of that size in commodity hardware. Fujitsu's A64FX will use 512b vectors.

We used predication in the example to show a loop without a scalar tail, but it's not necessary to use the ISA in that manner.

The RISC-V V extension is similar, though has a few extra bits to worry about. You'll need to ask Robin Kruppe if you want more details on that.

As far as your question on memory layout is concerned, we don't expect many base 'load' or 'store' instructions to be used for scalable vector types, and for an extremely conservative approach you could consider such a memory operation to potentially alias all memory in a given address space. Instead, we expect to always use masked load and store intrinsics (or gather/scatter intrinsics), and we will need to improve parts of AA to take the masks into account.

-Graham

> 
> Thanks,
> JinGu Kang
> 
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of JinGu Kang via llvm-dev <llvm-dev at lists.llvm.org>
> Sent: 24 May 2019 21:47
> To: Joel Jones; Chris Lattner; Hal Finkel; Jones, Joel; dag at cray.com; Renato Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe; llvm-dev at lists.llvm.org; mkuper at google.com; Sjoerd Meijer; Sam Parker; Graham Hunter
> Cc: nd
> Subject: Re: [llvm-dev] [EXT] Re: [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
>  
> Hi Joel,
> 
> Thanks for your kind guide. 
> 
> > https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture
> 
> I will have a look the post.
> 
> > One key restriction is that certain storage classes can’t contain scalable vector types, like statically allocated globals for example.
> 
> It was one of what I want to know.
> 
> Thanks,
> JinGu Kang
> From: Joel Jones <joelj at marvell.com>
> Sent: 24 May 2019 19:40
> To: JinGu Kang; Chris Lattner; Hal Finkel; Jones, Joel; dag at cray.com; Renato Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe; llvm-dev at lists.llvm.org; mkuper at google.com; Sjoerd Meijer; Sam Parker; Graham Hunter
> Cc: nd
> Subject: Re: [EXT] Re: [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
>  
> JinGu:
>  
> I’m not Graham, but you might find the following link a good starting point.
>  
> https://community.arm.com/developer/tools-software/hpc/b/hpc-blog/posts/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture
>  
> The question you ask doesn’t have a short answer. The compiler and the instruction set design work together to allow programs to be compiled without knowing until run-time what the vector width is (within limits of min and max possible widths). One key restriction is that certain storage classes can’t contain scalable vector types, like statically allocated globals for example.
>  
> Joel Jones
>  
> From: JinGu Kang <jingu at codeplay.com>
> Date: Friday, May 24, 2019 at 11:28 AM
> To: Chris Lattner <clattner at nondot.org>, Hal Finkel <hfinkel at anl.gov>, "Jones, Joel" <Joel.Jones at cavium.com>, "dag at cray.com" <dag at cray.com>, Renato Golin <renato.golin at linaro.org>, Kristof Beyls <Kristof.Beyls at arm.com>, Amara Emerson <aemerson at apple.com>, Florian Hahn <Florian.Hahn at arm.com>, Sander De Smalen <Sander.DeSmalen at arm.com>, Robin Kruppe <robin.kruppe at gmail.com>, "llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>, "mkuper at google.com" <mkuper at google.com>, Sjoerd Meijer <Sjoerd.Meijer at arm.com>, Sam Parker <Sam.Parker at arm.com>, Graham Hunter <Graham.Hunter at arm.com>
> Cc: nd <nd at arm.com>
> Subject: [EXT] Re: [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
>  
> External Email
> Hi Graham,
>  
> I am working on a custom target and it is considering scalable vector type representation in programming language. While I am collecting the information about it, I have met your RFC. I have a question. I think the one of fundamental issues is that we do not know the memory layout of the type at compile time. I am not sure whether the RFC covers this issue or not. Conservatively, I imagined the memory layout of biggest type which the scalable vector type can support. I could miss some discussions about it. If I missed something, please let me know.
>  
> Thanks,
> JinGu Kang
>  
> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> on behalf of Graham Hunter via llvm-dev <llvm-dev at lists.llvm.org>
> Sent: 05 June 2018 14:15
> To: Chris Lattner; Hal Finkel; Jones, Joel; dag at cray.com; Renato Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe; llvm-dev at lists.llvm.org; mkuper at google.com; Sjoerd Meijer; Sam Parker
> Cc: nd
> Subject: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths
>  
> Hi,
> 
> Now that Sander has committed enough MC support for SVE, here's an updated
> RFC for variable length vector support with a set of 14 patches (listed at the end)
> to demonstrate code generation for SVE using the extensions proposed in the RFC.
> 
> I have some ideas about how to support RISC-V's upcoming extension alongside
> SVE; I'll send an email with some additional comments on Robin's RFC later.
> 
> Feedback and questions welcome.
> 
> -Graham
> 
> =============================================================
> Supporting SIMD instruction sets with variable vector lengths
> =============================================================
> 
> In this RFC we propose extending LLVM IR to support code-generation for variable
> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our
> approach is backwards compatible and should be as non-intrusive as possible; the
> only change needed in other backends is how size is queried on vector types, and
> it only requires a change in which function is called. We have created a set of
> proof-of-concept patches to represent a simple vectorized loop in IR and
> generate SVE instructions from that IR. These patches (listed in section 7 of
> this rfc) can be found on Phabricator and are intended to illustrate the scope
> of changes required by the general approach described in this RFC.
> 
> ==========
> Background
> ==========
> 
> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for
> AArch64 which is intended to scale with hardware such that the same binary
> running on a processor with longer vector registers can take advantage of the
> increased compute power without recompilation.
> 
> As the vector length is no longer a compile-time known value, the way in which
> the LLVM vectorizer generates code requires modifications such that certain
> values are now runtime evaluated expressions instead of compile-time constants.
> 
> Documentation for SVE can be found at
> https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a
> 
> ========
> Contents
> ========
> 
> The rest of this RFC covers the following topics:
> 
> 1. Types -- a proposal to extend VectorType to be able to represent vectors that
>    have a length which is a runtime-determined multiple of a known base length.
> 
> 2. Size Queries - how to reason about the size of types for which the size isn't
>    fully known at compile time.
> 
> 3. Representing the runtime multiple of vector length in IR for use in address
>    calculations and induction variable comparisons.
> 
> 4. Generating 'constant' values in IR for vectors with a runtime-determined
>    number of elements.
> 
> 5. A brief note on code generation of these new operations for AArch64.
> 
> 6. An example of C code and matching IR using the proposed extensions.
> 
> 7. A list of patches demonstrating the changes required to emit SVE instructions
>    for a loop that has already been vectorized using the extensions described
>    in this RFC.
> 
> ========
> 1. Types
> ========
> 
> To represent a vector of unknown length a boolean `Scalable` property has been
> added to the `VectorType` class, which indicates that the number of elements in
> the vector is a runtime-determined integer multiple of the `NumElements` field.
> Most code that deals with vectors doesn't need to know the exact length, but
> does need to know relative lengths -- e.g. get a vector with the same number of
> elements but a different element type, or with half or double the number of
> elements.
> 
> In order to allow code to transparently support scalable vectors, we introduce
> an `ElementCount` class with two members:
> 
> - `unsigned Min`: the minimum number of elements.
> - `bool Scalable`: is the element count an unknown multiple of `Min`?
> 
> For non-scalable vectors (``Scalable=false``) the scale is considered to be
> equal to one and thus `Min` represents the exact number of elements in the
> vector.
> 
> The intent for code working with vectors is to use convenience methods and avoid
> directly dealing with the number of elements. If needed, calling
> `getElementCount` on a vector type instead of `getVectorNumElements` can be used
> to obtain the (potentially scalable) number of elements. Overloaded division and
> multiplication operators allow an ElementCount instance to be used in much the
> same manner as an integer for most cases.
> 
> This mixture of compile-time and runtime quantities allow us to reason about the
> relationship between different scalable vector types without knowing their
> exact length.
> 
> The runtime multiple is not expected to change during program execution for SVE,
> but it is possible. The model of scalable vectors presented in this RFC assumes
> that the multiple will be constant within a function but not necessarily across
> functions. As suggested in the recent RISC-V rfc, a new function attribute to
> inherit the multiple across function calls will allow for function calls with
> vector arguments/return values and inlining/outlining optimizations.
> 
> IR Textual Form
> ---------------
> 
> The textual form for a scalable vector is:
> 
> ``<scalable <n> x <type>>``
> 
> where `type` is the scalar type of each element, `n` is the minimum number of
> elements, and the string literal `scalable` indicates that the total number of
> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice
> for indicating that the vector is scalable, and could be substituted by another.
> For fixed-length vectors, the `scalable` is omitted, so there is no change in
> the format for existing vectors.
> 
> Scalable vectors with the same `Min` value have the same number of elements, and
> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are
> used within the same function):
> 
> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of
>   elements.
> 
> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of
>   bytes.
> 
> IR Bitcode Form
> ---------------
> 
> To serialize scalable vectors to bitcode, a new boolean field is added to the
> type record. If the field is not present the type will default to a fixed-length
> vector type, preserving backwards compatibility.
> 
> Alternatives Considered
> -----------------------
> 
> We did consider one main alternative -- a dedicated target type, like the
> x86_mmx type.
> 
> A dedicated target type would either need to extend all existing passes that
> work with vectors to recognize the new type, or to duplicate all that code
> in order to get reasonable code generation and autovectorization.
> 
> This hasn't been done for the x86_mmx type, and so it is only capable of
> providing support for C-level intrinsics instead of being used and recognized by
> passes inside llvm.
> 
> Although our current solution will need to change some of the code that creates
> new VectorTypes, much of that code doesn't need to care about whether the types
> are scalable or not -- they can use preexisting methods like
> `getHalfElementsVectorType`. If the code is a little more complex,
> `ElementCount` structs can be used instead of an `unsigned` value to represent
> the number of elements.
> 
> ===============
> 2. Size Queries
> ===============
> 
> This is a proposal for how to deal with querying the size of scalable types.
> While it has not been implemented in full, the general approach works well
> for calculating offsets into structures with scalable types in a modified
> version of ComputeValueVTs in our downstream compiler.
> 
> Current IR types that have a known size all return a single integer constant.
> For scalable types a second integer is needed to indicate the number of bytes
> which need to be scaled by the runtime multiple to obtain the actual length.
> 
> For primitive types, getPrimitiveSizeInBits will function as it does today,
> except that it will no longer return a size for vector types (it will return 0,
> as it does for other derived types). The majority of calls to this function are
> already for scalar rather than vector types.
> 
> For derived types, a function (getSizeExpressionInBits) to return a pair of
> integers (one to indicate unscaled bits, the other for bits that need to be
> scaled by the runtime multiple) will be added. For backends that do not need to
> deal with scalable types, another function (getFixedSizeExpressionInBits) that
> only returns unscaled bits will be provided, with a debug assert that the type
> isn't scalable.
> 
> Similar functionality will be added to DataLayout.
> 
> Comparing two of these sizes together is straightforward if only unscaled sizes
> are used. Comparisons between scaled sizes is also simple when comparing sizes
> within a function (or across functions with the inherit flag mentioned in the
> changes to the type), but cannot be compared otherwise. If a mix is present,
> then any number of unscaled bits will not be considered to have a greater size
> than a smaller number of scaled bits, but a smaller number of unscaled bits
> will be considered to have a smaller size than a greater number of scaled bits
> (since the runtime multiple is at least one).
> 
> Future Work
> -----------
> 
> Since we cannot determine the exact size of a scalable vector, the
> existing logic for alias detection won't work when multiple accesses
> share a common base pointer with different offsets.
> 
> However, SVE's predication will mean that a dynamic 'safe' vector length
> can be determined at runtime, so after initial support has been added we
> can work on vectorizing loops using runtime predication to avoid aliasing
> problems.
> 
> Alternatives Considered
> -----------------------
> 
> Marking scalable vectors as unsized doesn't work well, as many parts of
> llvm dealing with loads and stores assert that 'isSized()' returns true
> and make use of the size when calculating offsets.
> 
> We have considered introducing multiple helper functions instead of
> using direct size queries, but that doesn't cover all cases. It may
> still be a good idea to introduce them to make the purpose in a given
> case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.
> 
> ========================================
> 3. Representing Vector Length at Runtime
> ========================================
> 
> With a scalable vector type defined, we now need a way to represent the runtime
> length in IR in order to generate addresses for consecutive vectors in memory
> and determine how many elements have been processed in an iteration of a loop.
> 
> We have added an experimental `vscale` intrinsic to represent the runtime
> multiple. Multiplying the result of this intrinsic by the minimum number of
> elements in a vector gives the total number of elements in a scalable vector.
> 
> Fixed-Length Code
> -----------------
> 
> Assuming a vector type of <4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %index.next = add i64 %index, 4
>   ;; <check and branch>
> ``
> Scalable Equivalent
> -------------------
> 
> Assuming a vector type of <scalable 4 x <ty>>
> ``
> vector.body:
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   ;; <loop body>
>   ;; Increment induction var
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   ;; <check and branch>
> ``
> ===========================
> 4. Generating Vector Values
> ===========================
> For constant vector values, we cannot specify all the elements as we can for
> fixed-length vectors; fortunately only a small number of easily synthesized
> patterns are required for autovectorization. The `zeroinitializer` constant
> can be used in the same manner as fixed-length vectors for a constant zero
> splat. This can then be combined with `insertelement` and `shufflevector`
> to create arbitrary value splats in the same manner as fixed-length vectors.
> 
> For constants consisting of a sequence of values, an experimental `stepvector`
> intrinsic has been added to represent a simple constant of the form
> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new
> start can be added, and changing the step requires multiplying by a splat.
> 
> Fixed-Length Code
> -----------------
> ``
>   ;; Splat a value
>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
>   ;; Add a constant sequence
>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8>
> ``
> Scalable Equivalent
> -------------------
> ``
>   ;; Splat a value
>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0
>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Splat offset + stride (the same in this case)
>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0
>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   ;; Create sequence for scalable vector
>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off
>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off
>   ;; Add the runtime-generated sequence
>   %add = add <scalable 4 x i32> %splat, %addoffset
> ``
> Future Work
> -----------
> 
> Intrinsics cannot currently be used for constant folding. Our downstream
> compiler (using Constants instead of intrinsics) relies quite heavily on this
> for good code generation, so we will need to find new ways to recognize and
> fold these values.
> 
> ==================
> 5. Code Generation
> ==================
> 
> IR splats will be converted to an experimental splatvector intrinsic in
> SelectionDAGBuilder.
> 
> All three intrinsics are custom lowered and legalized in the AArch64 backend.
> 
> Two new AArch64ISD nodes have been added to represent the same concepts
> at the SelectionDAG level, while splatvector maps onto the existing
> AArch64ISD::DUP.
> 
> GlobalISel
> ----------
> 
> Since GlobalISel was enabled by default on AArch64, it was necessary to add
> scalable vector support to the LowLevelType implementation. A single bit was
> added to the raw_data representation for vectors and vectors of pointers.
> 
> In addition, types that only exist in destination patterns are planted in
> the enumeration of available types for generated code. While this may not be
> necessary in future, generating an all-true 'ptrue' value was necessary to
> convert a predicated instruction into an unpredicated one.
> 
> ==========
> 6. Example
> ==========
> 
> The following example shows a simple C loop which assigns the array index to
> the array elements matching that index. The IR shows how vscale and stepvector
> are used to create the needed values and to advance the index variable in the
> loop.
> 
> C Code
> ------
> 
> ``
> void IdentityArrayInit(int *a, int count) {
>   for (int i = 0; i < count; ++i)
>     a[i] = i;
> }
> ``
> 
> Scalable IR Vector Body
> -----------------------
> 
> ``
> vector.body.preheader:
>   ;; Other setup
>   ;; Stepvector used to create initial identity vector
>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
>   br vector.body
> 
> vector.body
>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]
>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]
> 
>            ;; stepvector used for index identity on entry to loop body ;;
>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],
>                                      [ %stepvector, %vector.body.preheader ]
>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()
>   %vscale32 = trunc i64 %vscale64 to i32
>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)
> 
>            ;; vscale splat used to increment identity vector ;;
>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0
>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer
>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat
>   %2 = getelementptr inbounds i32, i32* %a, i64 %0
>   %3 = bitcast i32* %2 to <scalable 4 x i32>*
>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4
> 
>            ;; vscale used to increment loop index
>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)
>   %4 = icmp eq i64 %index.next, %n.vec
>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5
> ``
> 
> ==========
> 7. Patches
> ==========
> 
> List of patches:
> 
> 1. Extend VectorType: https://reviews.llvm.org/D32530
> 2. Vector element type Tablegen constraint: https://reviews.llvm.org/D47768
> 3. LLT support for scalable vectors: https://reviews.llvm.org/D47769
> 4. EVT strings and Type mapping: https://reviews.llvm.org/D47770
> 5. SVE Calling Convention: https://reviews.llvm.org/D47771
> 6. Intrinsic lowering cleanup: https://reviews.llvm.org/D47772
> 7. Add VScale intrinsic: https://reviews.llvm.org/D47773
> 8. Add StepVector intrinsic: https://reviews.llvm.org/D47774
> 9. Add SplatVector intrinsic: https://reviews.llvm.org/D47775
> 10. Initial store patterns: https://reviews.llvm.org/D47776
> 11. Initial addition patterns: https://reviews.llvm.org/D47777
> 12. Initial left-shift patterns: https://reviews.llvm.org/D47778
> 13. Implement copy logic for Z regs: https://reviews.llvm.org/D47779
> 14. Prevectorized loop unit test: https://reviews.llvm.org/D47780
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev