<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>

</head>

<body dir="ltr">

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Hi Bruce,<br>

</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

Thanks for your comment.<br>

</div>

<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">

<br>

> Generic code can enquire the size, dynamically allocate space, and<br>

transparently save and restore the contents of a vector register or<br>

<div>registers.</div>

<div><br>

</div>

<div>I am not sure what the Generic code is. It seems it uses similar way as the implementation of variable length array. If so, I think it could be one way to support it.</div>

<div><br>

</div>

<div>Thanks,</div>

<div>JinGu Kang</div>

<br>

</div>

<div>

<hr tabindex="-1" style="display:inline-block; width:98%">

<div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> Bruce Hoult <brucehoult@sifive.com><br>

<b>Sent:</b> 24 May 2019 20:12<br>

<b>To:</b> JinGu Kang<br>

<b>Cc:</b> Chris Lattner; Hal Finkel; Jones, Joel; dag@cray.com; Renato Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe; llvm-dev@lists.llvm.org; mkuper@google.com; Sjoerd Meijer; Sam Parker; Graham Hunter; nd<br>

<b>Subject:</b> Re: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths</font>

<div> </div>

</div>

<div class="BodyFragment"><font size="2"><span style="font-size:11pt">

<div class="PlainText">In the RISC-V V extension, there is no upper limit to the size vector<br>

registers can be in a future CPU. (Formally, the upper limit is at<br>

least 2^31 bytes)<br>

<br>

Generic code can enquire the size, dynamically allocate space, and<br>

transparently save and restore the contents of a vector register or<br>

registers.<br>

<br>

On Fri, May 24, 2019 at 11:28 AM JinGu Kang via llvm-dev<br>

<llvm-dev@lists.llvm.org> wrote:<br>

><br>

> Hi Graham,<br>

><br>

> I am working on a custom target and it is considering scalable vector type representation in programming language. While I am collecting the information about it, I have met your RFC. I have a question. I think the one of fundamental issues is that we do

 not know the memory layout of the type at compile time. I am not sure whether the RFC covers this issue or not. Conservatively, I imagined the memory layout of biggest type which the scalable vector type can support. I could miss some discussions about it.

 If I missed something, please let me know.<br>

><br>

> Thanks,<br>

> JinGu Kang<br>

><br>

> ________________________________<br>

> From: llvm-dev <llvm-dev-bounces@lists.llvm.org> on behalf of Graham Hunter via llvm-dev <llvm-dev@lists.llvm.org><br>

> Sent: 05 June 2018 14:15<br>

> To: Chris Lattner; Hal Finkel; Jones, Joel; dag@cray.com; Renato Golin; Kristof Beyls; Amara Emerson; Florian Hahn; Sander De Smalen; Robin Kruppe; llvm-dev@lists.llvm.org; mkuper@google.com; Sjoerd Meijer; Sam Parker<br>

> Cc: nd<br>

> Subject: [llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths<br>

><br>

> Hi,<br>

><br>

> Now that Sander has committed enough MC support for SVE, here's an updated<br>

> RFC for variable length vector support with a set of 14 patches (listed at the end)<br>

> to demonstrate code generation for SVE using the extensions proposed in the RFC.<br>

><br>

> I have some ideas about how to support RISC-V's upcoming extension alongside<br>

> SVE; I'll send an email with some additional comments on Robin's RFC later.<br>

><br>

> Feedback and questions welcome.<br>

><br>

> -Graham<br>

><br>

> =============================================================<br>

> Supporting SIMD instruction sets with variable vector lengths<br>

> =============================================================<br>

><br>

> In this RFC we propose extending LLVM IR to support code-generation for variable<br>

> length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our<br>

> approach is backwards compatible and should be as non-intrusive as possible; the<br>

> only change needed in other backends is how size is queried on vector types, and<br>

> it only requires a change in which function is called. We have created a set of<br>

> proof-of-concept patches to represent a simple vectorized loop in IR and<br>

> generate SVE instructions from that IR. These patches (listed in section 7 of<br>

> this rfc) can be found on Phabricator and are intended to illustrate the scope<br>

> of changes required by the general approach described in this RFC.<br>

><br>

> ==========<br>

> Background<br>

> ==========<br>

><br>

> *ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for<br>

> AArch64 which is intended to scale with hardware such that the same binary<br>

> running on a processor with longer vector registers can take advantage of the<br>

> increased compute power without recompilation.<br>

><br>

> As the vector length is no longer a compile-time known value, the way in which<br>

> the LLVM vectorizer generates code requires modifications such that certain<br>

> values are now runtime evaluated expressions instead of compile-time constants.<br>

><br>

> Documentation for SVE can be found at<br>

> <a href="https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a">

https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a</a><br>

><br>

> ========<br>

> Contents<br>

> ========<br>

><br>

> The rest of this RFC covers the following topics:<br>

><br>

> 1. Types -- a proposal to extend VectorType to be able to represent vectors that<br>

>    have a length which is a runtime-determined multiple of a known base length.<br>

><br>

> 2. Size Queries - how to reason about the size of types for which the size isn't<br>

>    fully known at compile time.<br>

><br>

> 3. Representing the runtime multiple of vector length in IR for use in address<br>

>    calculations and induction variable comparisons.<br>

><br>

> 4. Generating 'constant' values in IR for vectors with a runtime-determined<br>

>    number of elements.<br>

><br>

> 5. A brief note on code generation of these new operations for AArch64.<br>

><br>

> 6. An example of C code and matching IR using the proposed extensions.<br>

><br>

> 7. A list of patches demonstrating the changes required to emit SVE instructions<br>

>    for a loop that has already been vectorized using the extensions described<br>

>    in this RFC.<br>

><br>

> ========<br>

> 1. Types<br>

> ========<br>

><br>

> To represent a vector of unknown length a boolean `Scalable` property has been<br>

> added to the `VectorType` class, which indicates that the number of elements in<br>

> the vector is a runtime-determined integer multiple of the `NumElements` field.<br>

> Most code that deals with vectors doesn't need to know the exact length, but<br>

> does need to know relative lengths -- e.g. get a vector with the same number of<br>

> elements but a different element type, or with half or double the number of<br>

> elements.<br>

><br>

> In order to allow code to transparently support scalable vectors, we introduce<br>

> an `ElementCount` class with two members:<br>

><br>

> - `unsigned Min`: the minimum number of elements.<br>

> - `bool Scalable`: is the element count an unknown multiple of `Min`?<br>

><br>

> For non-scalable vectors (``Scalable=false``) the scale is considered to be<br>

> equal to one and thus `Min` represents the exact number of elements in the<br>

> vector.<br>

><br>

> The intent for code working with vectors is to use convenience methods and avoid<br>

> directly dealing with the number of elements. If needed, calling<br>

> `getElementCount` on a vector type instead of `getVectorNumElements` can be used<br>

> to obtain the (potentially scalable) number of elements. Overloaded division and<br>

> multiplication operators allow an ElementCount instance to be used in much the<br>

> same manner as an integer for most cases.<br>

><br>

> This mixture of compile-time and runtime quantities allow us to reason about the<br>

> relationship between different scalable vector types without knowing their<br>

> exact length.<br>

><br>

> The runtime multiple is not expected to change during program execution for SVE,<br>

> but it is possible. The model of scalable vectors presented in this RFC assumes<br>

> that the multiple will be constant within a function but not necessarily across<br>

> functions. As suggested in the recent RISC-V rfc, a new function attribute to<br>

> inherit the multiple across function calls will allow for function calls with<br>

> vector arguments/return values and inlining/outlining optimizations.<br>

><br>

> IR Textual Form<br>

> ---------------<br>

><br>

> The textual form for a scalable vector is:<br>

><br>

> ``<scalable <n> x <type>>``<br>

><br>

> where `type` is the scalar type of each element, `n` is the minimum number of<br>

> elements, and the string literal `scalable` indicates that the total number of<br>

> elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice<br>

> for indicating that the vector is scalable, and could be substituted by another.<br>

> For fixed-length vectors, the `scalable` is omitted, so there is no change in<br>

> the format for existing vectors.<br>

><br>

> Scalable vectors with the same `Min` value have the same number of elements, and<br>

> the same number of bytes if `Min * sizeof(type)` is the same (assuming they are<br>

> used within the same function):<br>

><br>

> ``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of<br>

>   elements.<br>

><br>

> ``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of<br>

>   bytes.<br>

><br>

> IR Bitcode Form<br>

> ---------------<br>

><br>

> To serialize scalable vectors to bitcode, a new boolean field is added to the<br>

> type record. If the field is not present the type will default to a fixed-length<br>

> vector type, preserving backwards compatibility.<br>

><br>

> Alternatives Considered<br>

> -----------------------<br>

><br>

> We did consider one main alternative -- a dedicated target type, like the<br>

> x86_mmx type.<br>

><br>

> A dedicated target type would either need to extend all existing passes that<br>

> work with vectors to recognize the new type, or to duplicate all that code<br>

> in order to get reasonable code generation and autovectorization.<br>

><br>

> This hasn't been done for the x86_mmx type, and so it is only capable of<br>

> providing support for C-level intrinsics instead of being used and recognized by<br>

> passes inside llvm.<br>

><br>

> Although our current solution will need to change some of the code that creates<br>

> new VectorTypes, much of that code doesn't need to care about whether the types<br>

> are scalable or not -- they can use preexisting methods like<br>

> `getHalfElementsVectorType`. If the code is a little more complex,<br>

> `ElementCount` structs can be used instead of an `unsigned` value to represent<br>

> the number of elements.<br>

><br>

> ===============<br>

> 2. Size Queries<br>

> ===============<br>

><br>

> This is a proposal for how to deal with querying the size of scalable types.<br>

> While it has not been implemented in full, the general approach works well<br>

> for calculating offsets into structures with scalable types in a modified<br>

> version of ComputeValueVTs in our downstream compiler.<br>

><br>

> Current IR types that have a known size all return a single integer constant.<br>

> For scalable types a second integer is needed to indicate the number of bytes<br>

> which need to be scaled by the runtime multiple to obtain the actual length.<br>

><br>

> For primitive types, getPrimitiveSizeInBits will function as it does today,<br>

> except that it will no longer return a size for vector types (it will return 0,<br>

> as it does for other derived types). The majority of calls to this function are<br>

> already for scalar rather than vector types.<br>

><br>

> For derived types, a function (getSizeExpressionInBits) to return a pair of<br>

> integers (one to indicate unscaled bits, the other for bits that need to be<br>

> scaled by the runtime multiple) will be added. For backends that do not need to<br>

> deal with scalable types, another function (getFixedSizeExpressionInBits) that<br>

> only returns unscaled bits will be provided, with a debug assert that the type<br>

> isn't scalable.<br>

><br>

> Similar functionality will be added to DataLayout.<br>

><br>

> Comparing two of these sizes together is straightforward if only unscaled sizes<br>

> are used. Comparisons between scaled sizes is also simple when comparing sizes<br>

> within a function (or across functions with the inherit flag mentioned in the<br>

> changes to the type), but cannot be compared otherwise. If a mix is present,<br>

> then any number of unscaled bits will not be considered to have a greater size<br>

> than a smaller number of scaled bits, but a smaller number of unscaled bits<br>

> will be considered to have a smaller size than a greater number of scaled bits<br>

> (since the runtime multiple is at least one).<br>

><br>

> Future Work<br>

> -----------<br>

><br>

> Since we cannot determine the exact size of a scalable vector, the<br>

> existing logic for alias detection won't work when multiple accesses<br>

> share a common base pointer with different offsets.<br>

><br>

> However, SVE's predication will mean that a dynamic 'safe' vector length<br>

> can be determined at runtime, so after initial support has been added we<br>

> can work on vectorizing loops using runtime predication to avoid aliasing<br>

> problems.<br>

><br>

> Alternatives Considered<br>

> -----------------------<br>

><br>

> Marking scalable vectors as unsized doesn't work well, as many parts of<br>

> llvm dealing with loads and stores assert that 'isSized()' returns true<br>

> and make use of the size when calculating offsets.<br>

><br>

> We have considered introducing multiple helper functions instead of<br>

> using direct size queries, but that doesn't cover all cases. It may<br>

> still be a good idea to introduce them to make the purpose in a given<br>

> case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'.<br>

><br>

> ========================================<br>

> 3. Representing Vector Length at Runtime<br>

> ========================================<br>

><br>

> With a scalable vector type defined, we now need a way to represent the runtime<br>

> length in IR in order to generate addresses for consecutive vectors in memory<br>

> and determine how many elements have been processed in an iteration of a loop.<br>

><br>

> We have added an experimental `vscale` intrinsic to represent the runtime<br>

> multiple. Multiplying the result of this intrinsic by the minimum number of<br>

> elements in a vector gives the total number of elements in a scalable vector.<br>

><br>

> Fixed-Length Code<br>

> -----------------<br>

><br>

> Assuming a vector type of <4 x <ty>><br>

> ``<br>

> vector.body:<br>

>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]<br>

>   ;; <loop body><br>

>   ;; Increment induction var<br>

>   %index.next = add i64 %index, 4<br>

>   ;; <check and branch><br>

> ``<br>

> Scalable Equivalent<br>

> -------------------<br>

><br>

> Assuming a vector type of <scalable 4 x <ty>><br>

> ``<br>

> vector.body:<br>

>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]<br>

>   ;; <loop body><br>

>   ;; Increment induction var<br>

>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()<br>

>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)<br>

>   ;; <check and branch><br>

> ``<br>

> ===========================<br>

> 4. Generating Vector Values<br>

> ===========================<br>

> For constant vector values, we cannot specify all the elements as we can for<br>

> fixed-length vectors; fortunately only a small number of easily synthesized<br>

> patterns are required for autovectorization. The `zeroinitializer` constant<br>

> can be used in the same manner as fixed-length vectors for a constant zero<br>

> splat. This can then be combined with `insertelement` and `shufflevector`<br>

> to create arbitrary value splats in the same manner as fixed-length vectors.<br>

><br>

> For constants consisting of a sequence of values, an experimental `stepvector`<br>

> intrinsic has been added to represent a simple constant of the form<br>

> `<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new<br>

> start can be added, and changing the step requires multiplying by a splat.<br>

><br>

> Fixed-Length Code<br>

> -----------------<br>

> ``<br>

>   ;; Splat a value<br>

>   %insert = insertelement <4 x i32> undef, i32 %value, i32 0<br>

>   %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer<br>

>   ;; Add a constant sequence<br>

>   %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8><br>

> ``<br>

> Scalable Equivalent<br>

> -------------------<br>

> ``<br>

>   ;; Splat a value<br>

>   %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0<br>

>   %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer<br>

>   ;; Splat offset + stride (the same in this case)<br>

>   %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0<br>

>   %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer<br>

>   ;; Create sequence for scalable vector<br>

>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()<br>

>   %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off<br>

>   %addoffset = add <scalable 4 x i32> %mulbystride, %str_off<br>

>   ;; Add the runtime-generated sequence<br>

>   %add = add <scalable 4 x i32> %splat, %addoffset<br>

> ``<br>

> Future Work<br>

> -----------<br>

><br>

> Intrinsics cannot currently be used for constant folding. Our downstream<br>

> compiler (using Constants instead of intrinsics) relies quite heavily on this<br>

> for good code generation, so we will need to find new ways to recognize and<br>

> fold these values.<br>

><br>

> ==================<br>

> 5. Code Generation<br>

> ==================<br>

><br>

> IR splats will be converted to an experimental splatvector intrinsic in<br>

> SelectionDAGBuilder.<br>

><br>

> All three intrinsics are custom lowered and legalized in the AArch64 backend.<br>

><br>

> Two new AArch64ISD nodes have been added to represent the same concepts<br>

> at the SelectionDAG level, while splatvector maps onto the existing<br>

> AArch64ISD::DUP.<br>

><br>

> GlobalISel<br>

> ----------<br>

><br>

> Since GlobalISel was enabled by default on AArch64, it was necessary to add<br>

> scalable vector support to the LowLevelType implementation. A single bit was<br>

> added to the raw_data representation for vectors and vectors of pointers.<br>

><br>

> In addition, types that only exist in destination patterns are planted in<br>

> the enumeration of available types for generated code. While this may not be<br>

> necessary in future, generating an all-true 'ptrue' value was necessary to<br>

> convert a predicated instruction into an unpredicated one.<br>

><br>

> ==========<br>

> 6. Example<br>

> ==========<br>

><br>

> The following example shows a simple C loop which assigns the array index to<br>

> the array elements matching that index. The IR shows how vscale and stepvector<br>

> are used to create the needed values and to advance the index variable in the<br>

> loop.<br>

><br>

> C Code<br>

> ------<br>

><br>

> ``<br>

> void IdentityArrayInit(int *a, int count) {<br>

>   for (int i = 0; i < count; ++i)<br>

>     a[i] = i;<br>

> }<br>

> ``<br>

><br>

> Scalable IR Vector Body<br>

> -----------------------<br>

><br>

> ``<br>

> vector.body.preheader:<br>

>   ;; Other setup<br>

>   ;; Stepvector used to create initial identity vector<br>

>   %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()<br>

>   br vector.body<br>

><br>

> vector.body<br>

>   %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]<br>

>   %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]<br>

><br>

>            ;; stepvector used for index identity on entry to loop body ;;<br>

>   %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],<br>

>                                      [ %stepvector, %vector.body.preheader ]<br>

>   %vscale64 = call i64 @llvm.experimental.vector.vscale.64()<br>

>   %vscale32 = trunc i64 %vscale64 to i32<br>

>   %1 = add i64 %0, mul (i64 %vscale64, i64 4)<br>

><br>

>            ;; vscale splat used to increment identity vector ;;<br>

>   %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0<br>

>   %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer<br>

>   %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat<br>

>   %2 = getelementptr inbounds i32, i32* %a, i64 %0<br>

>   %3 = bitcast i32* %2 to <scalable 4 x i32>*<br>

>   store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4<br>

><br>

>            ;; vscale used to increment loop index<br>

>   %index.next = add i64 %index, mul (i64 %vscale64, i64 4)<br>

>   %4 = icmp eq i64 %index.next, %n.vec<br>

>   br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5<br>

> ``<br>

><br>

> ==========<br>

> 7. Patches<br>

> ==========<br>

><br>

> List of patches:<br>

><br>

> 1. Extend VectorType: <a href="https://reviews.llvm.org/D32530">https://reviews.llvm.org/D32530</a><br>

> 2. Vector element type Tablegen constraint: <a href="https://reviews.llvm.org/D47768">

https://reviews.llvm.org/D47768</a><br>

> 3. LLT support for scalable vectors: <a href="https://reviews.llvm.org/D47769">

https://reviews.llvm.org/D47769</a><br>

> 4. EVT strings and Type mapping: <a href="https://reviews.llvm.org/D47770">https://reviews.llvm.org/D47770</a><br>

> 5. SVE Calling Convention: <a href="https://reviews.llvm.org/D47771">https://reviews.llvm.org/D47771</a><br>

> 6. Intrinsic lowering cleanup: <a href="https://reviews.llvm.org/D47772">https://reviews.llvm.org/D47772</a><br>

> 7. Add VScale intrinsic: <a href="https://reviews.llvm.org/D47773">https://reviews.llvm.org/D47773</a><br>

> 8. Add StepVector intrinsic: <a href="https://reviews.llvm.org/D47774">https://reviews.llvm.org/D47774</a><br>

> 9. Add SplatVector intrinsic: <a href="https://reviews.llvm.org/D47775">https://reviews.llvm.org/D47775</a><br>

> 10. Initial store patterns: <a href="https://reviews.llvm.org/D47776">https://reviews.llvm.org/D47776</a><br>

> 11. Initial addition patterns: <a href="https://reviews.llvm.org/D47777">https://reviews.llvm.org/D47777</a><br>

> 12. Initial left-shift patterns: <a href="https://reviews.llvm.org/D47778">https://reviews.llvm.org/D47778</a><br>

> 13. Implement copy logic for Z regs: <a href="https://reviews.llvm.org/D47779">

https://reviews.llvm.org/D47779</a><br>

> 14. Prevectorized loop unit test: <a href="https://reviews.llvm.org/D47780">https://reviews.llvm.org/D47780</a><br>

><br>

> _______________________________________________<br>

> LLVM Developers mailing list<br>

> llvm-dev@lists.llvm.org<br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

> _______________________________________________<br>

> LLVM Developers mailing list<br>

> llvm-dev@lists.llvm.org<br>

> <a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</div>

</span></font></div>

</div>

</body>

</html>