<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">

Hi Robin,

<div class=""><br class="">

</div>

<div class="">Thanks for the comments; replies inline (except for the stack regions question;</div>

<div class="">Sander is handling that side of things).</div>

<div class=""><br class="">

</div>

<div class="">-Graham</div>

<div class="">

<div><br class="">

<blockquote type="cite" class="">

<div class="">On 8 Jun 2018, at 16:24, Robin Kruppe <<a href="mailto:robin.kruppe@gmail.com" class="">robin.kruppe@gmail.com</a>> wrote:</div>

<br class="Apple-interchange-newline">

<div class="">

<div dir="ltr" style="font-family: Helvetica; font-size: 16px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">

<div class="">Hi Graham,</div>

<div class=""><br class="">

</div>

First of all, thanks a lot for updating the RFC and also for putting up the<br class="">

patches, they are quite interesting and illuminate some details I was curious<br class="">

about. I have some minor questions and comments inline below but overall I<br class="">

believe this is both a reasonably small extension of LLVM IR and powerful<br class="">

enough to support SVE, RVV, and hopefully future ISAs with variable length<br class="">

<div class="">vectors. Details may change as we gather more experience, but it's a very<br class="">

</div>

<div class="">good starting point.<br class="">

</div>

<div class=""><br class="">

</div>

<div class="">

<div class=""></div>

<div class="">One thing I am missing is a discussion of how stack frame layout will be<br class="">

handled. Earlier RFCs mentioned a concept called "Stack Regions" but (IIRC)<br class="">

gave little details and it was a long time ago anyway. What are your current<br class="">

plans here?<br class="">

</div>

</div>

<div class=""><br class="">

</div>

I'll reply separately to the sub-thread about RISC-V codegen.<br class="">

<div class=""><br class="">

</div>

<div class="">Cheers,</div>

<div class="">Robin<br class="">

</div>

<div class="gmail_extra"><br class="">

<div class="gmail_quote">On 5 June 2018 at 15:15, Graham Hunter<span class="Apple-converted-space"> </span><span dir="ltr" class=""><<a href="mailto:Graham.Hunter@arm.com" target="_blank" class="">Graham.Hunter@arm.com</a>></span><span class="Apple-converted-space"> </span>wrote:<br class="">

<blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;">

Hi,<br class="">

<br class="">

Now that Sander has committed enough MC support for SVE, here's an updated<br class="">

RFC for variable length vector support with a set of 14 patches (listed at the end)<br class="">

to demonstrate code generation for SVE using the extensions proposed in the RFC.<br class="">

<br class="">

I have some ideas about how to support RISC-V's upcoming extension alongside<br class="">

SVE; I'll send an email with some additional comments on Robin's RFC later.<br class="">

<br class="">

Feedback and questions welcome.<br class="">

<br class="">

-Graham<br class="">

<br class="">

==============================<wbr class="">==============================<wbr class="">=<br class="">

Supporting SIMD instruction sets with variable vector lengths<br class="">

==============================<wbr class="">==============================<wbr class="">=<br class="">

<br class="">

In this RFC we propose extending LLVM IR to support code-generation for variable<br class="">

length vector architectures like Arm's SVE or RISC-V's 'V' extension. Our<br class="">

approach is backwards compatible and should be as non-intrusive as possible; the<br class="">

only change needed in other backends is how size is queried on vector types, and<br class="">

it only requires a change in which function is called. We have created a set of<br class="">

proof-of-concept patches to represent a simple vectorized loop in IR and<br class="">

generate SVE instructions from that IR. These patches (listed in section 7 of<br class="">

this rfc) can be found on Phabricator and are intended to illustrate the scope<br class="">

of changes required by the general approach described in this RFC.<br class="">

<br class="">

==========<br class="">

Background<br class="">

==========<br class="">

<br class="">

*ARMv8-A Scalable Vector Extensions* (SVE) is a new vector ISA extension for<br class="">

AArch64 which is intended to scale with hardware such that the same binary<br class="">

running on a processor with longer vector registers can take advantage of the<br class="">

increased compute power without recompilation.<br class="">

<br class="">

As the vector length is no longer a compile-time known value, the way in which<br class="">

the LLVM vectorizer generates code requires modifications such that certain<br class="">

values are now runtime evaluated expressions instead of compile-time constants.<br class="">

<br class="">

Documentation for SVE can be found at<br class="">

<a href="https://developer.arm.com/docs/ddi0584/latest/arm-architecture-reference-manual-supplement-the-scalable-vector-extension-sve-for-armv8-a" rel="noreferrer" target="_blank" class="">https://developer.arm.com/docs<wbr class="">/ddi0584/latest/arm-architectu<wbr class="">re-reference-manual-supplement<wbr class="">-the-scalable-vector-<wbr class="">extension-sve-for-armv8-a</a><br class="">

<br class="">

========<br class="">

Contents<br class="">

========<br class="">

<br class="">

The rest of this RFC covers the following topics:<br class="">

<br class="">

1. Types -- a proposal to extend VectorType to be able to represent vectors that<br class="">

   have a length which is a runtime-determined multiple of a known base length.<br class="">

<br class="">

2. Size Queries - how to reason about the size of types for which the size isn't<br class="">

   fully known at compile time.<br class="">

<br class="">

3. Representing the runtime multiple of vector length in IR for use in address<br class="">

   calculations and induction variable comparisons.<br class="">

<br class="">

4. Generating 'constant' values in IR for vectors with a runtime-determined<br class="">

   number of elements.<br class="">

<br class="">

5. A brief note on code generation of these new operations for AArch64.<br class="">

<br class="">

6. An example of C code and matching IR using the proposed extensions.<br class="">

<br class="">

7. A list of patches demonstrating the changes required to emit SVE instructions<br class="">

   for a loop that has already been vectorized using the extensions described<br class="">

   in this RFC.<br class="">

<br class="">

========<br class="">

1. Types<br class="">

========<br class="">

<br class="">

To represent a vector of unknown length a boolean `Scalable` property has been<br class="">

added to the `VectorType` class, which indicates that the number of elements in<br class="">

the vector is a runtime-determined integer multiple of the `NumElements` field.<br class="">

Most code that deals with vectors doesn't need to know the exact length, but<br class="">

does need to know relative lengths -- e.g. get a vector with the same number of<br class="">

elements but a different element type, or with half or double the number of<br class="">

elements.<br class="">

<br class="">

In order to allow code to transparently support scalable vectors, we introduce<br class="">

an `ElementCount` class with two members:<br class="">

<br class="">

- `unsigned Min`: the minimum number of elements.<br class="">

- `bool Scalable`: is the element count an unknown multiple of `Min`?<br class="">

<br class="">

For non-scalable vectors (``Scalable=false``) the scale is considered to be<br class="">

equal to one and thus `Min` represents the exact number of elements in the<br class="">

vector.<br class="">

<br class="">

The intent for code working with vectors is to use convenience methods and avoid<br class="">

directly dealing with the number of elements. If needed, calling<br class="">

`getElementCount` on a vector type instead of `getVectorNumElements` can be used<br class="">

to obtain the (potentially scalable) number of elements. Overloaded division and<br class="">

multiplication operators allow an ElementCount instance to be used in much the<br class="">

same manner as an integer for most cases.<br class="">

<br class="">

This mixture of compile-time and runtime quantities allow us to reason about the<br class="">

relationship between different scalable vector types without knowing their<br class="">

exact length.<br class="">

<br class="">

The runtime multiple is not expected to change during program execution for SVE,<br class="">

but it is possible. The model of scalable vectors presented in this RFC assumes<br class="">

that the multiple will be constant within a function but not necessarily across<br class="">

functions. As suggested in the recent RISC-V rfc, a new function attribute to<br class="">

inherit the multiple across function calls will allow for function calls with<br class="">

vector arguments/return values and inlining/outlining optimizations.<br class="">

<br class="">

IR Textual Form<br class="">

---------------<br class="">

<br class="">

The textual form for a scalable vector is:<br class="">

<br class="">

``<scalable <n> x <type>>``<br class="">

<br class="">

where `type` is the scalar type of each element, `n` is the minimum number of<br class="">

elements, and the string literal `scalable` indicates that the total number of<br class="">

elements is an unknown multiple of `n`; `scalable` is just an arbitrary choice<br class="">

for indicating that the vector is scalable, and could be substituted by another.<br class="">

For fixed-length vectors, the `scalable` is omitted, so there is no change in<br class="">

the format for existing vectors.<br class="">

<br class="">

Scalable vectors with the same `Min` value have the same number of elements, and<br class="">

the same number of bytes if `Min * sizeof(type)` is the same (assuming they are<br class="">

used within the same function):<br class="">

<br class="">

``<scalable 4 x i32>`` and ``<scalable 4 x i8>`` have the same number of<br class="">

  elements.<br class="">

<br class="">

``<scalable x 4 x i32>`` and ``<scalable x 8 x i16>`` have the same number of<br class="">

  bytes.<br class="">

<br class="">

IR Bitcode Form<br class="">

---------------<br class="">

<br class="">

To serialize scalable vectors to bitcode, a new boolean field is added to the<br class="">

type record. If the field is not present the type will default to a fixed-length<br class="">

vector type, preserving backwards compatibility.<br class="">

<br class="">

Alternatives Considered<br class="">

-----------------------<br class="">

<br class="">

We did consider one main alternative -- a dedicated target type, like the<br class="">

x86_mmx type.<br class="">

<br class="">

A dedicated target type would either need to extend all existing passes that<br class="">

work with vectors to recognize the new type, or to duplicate all that code<br class="">

in order to get reasonable code generation and autovectorization.<br class="">

<br class="">

This hasn't been done for the x86_mmx type, and so it is only capable of<br class="">

providing support for C-level intrinsics instead of being used and recognized by<br class="">

passes inside llvm.<br class="">

<br class="">

Although our current solution will need to change some of the code that creates<br class="">

new VectorTypes, much of that code doesn't need to care about whether the types<br class="">

are scalable or not -- they can use preexisting methods like<br class="">

`getHalfElementsVectorType`. If the code is a little more complex,<br class="">

`ElementCount` structs can be used instead of an `unsigned` value to represent<br class="">

the number of elements.<br class="">

<br class="">

===============<br class="">

2. Size Queries<br class="">

===============<br class="">

<br class="">

This is a proposal for how to deal with querying the size of scalable types.<br class="">

While it has not been implemented in full, the general approach works well<br class="">

for calculating offsets into structures with scalable types in a modified<br class="">

version of ComputeValueVTs in our downstream compiler.<br class="">

<br class="">

Current IR types that have a known size all return a single integer constant.<br class="">

For scalable types a second integer is needed to indicate the number of bytes<br class="">

which need to be scaled by the runtime multiple to obtain the actual length.<br class="">

<br class="">

For primitive types, getPrimitiveSizeInBits will function as it does today,<br class="">

except that it will no longer return a size for vector types (it will return 0,<br class="">

as it does for other derived types). The majority of calls to this function are<br class="">

already for scalar rather than vector types.<br class="">

<br class="">

For derived types, a function (getSizeExpressionInBits) to return a pair of<br class="">

integers (one to indicate unscaled bits, the other for bits that need to be<br class="">

scaled by the runtime multiple) will be added. For backends that do not need to<br class="">

deal with scalable types, another function (getFixedSizeExpressionInBits) that<br class="">

only returns unscaled bits will be provided, with a debug assert that the type<br class="">

isn't scalable.<br class="">

<br class="">

Similar functionality will be added to DataLayout.<br class="">

<br class="">

Comparing two of these sizes together is straightforward if only unscaled sizes<br class="">

are used. Comparisons between scaled sizes is also simple when comparing sizes<br class="">

within a function (or across functions with the inherit flag mentioned in the<br class="">

changes to the type), but cannot be compared otherwise. If a mix is present,<br class="">

then any number of unscaled bits will not be considered to have a greater size<br class="">

than a smaller number of scaled bits, but a smaller number of unscaled bits<br class="">

will be considered to have a smaller size than a greater number of scaled bits<br class="">

(since the runtime multiple is at least one).<br class="">

</blockquote>

<div class=""><br class="">

</div>

<div class="">You mention it's not fully implemented yet, but perhaps you have some thoughts<br class="">

on what the APIs for this should look like?<br class="">

<br class="">

For size comparisons it's concerning that it could silently give misleading<br class="">

results when operating across function boundaries. One solution could be a<br class="">

function-specific API like `compareSizesIn(Type *, Type *, Function *)`, but<br class="">

that extra parameter may turn out to be very viral. OTOH, maybe that's<br class="">

unavoidable complexity from having type "sizes" vary between functions.<br class="">

</div>

<div class=""><br class="">

</div>

<div class="">Alternatively, general size queries could return "incomparable" and "only if<br class="">

in the same function" results in addition to smaller/larger/equal. This might<br class="">

nudge code to handling all these possibilities as well as it can.<br class="">

</div>

</div>

</div>

</div>

</div>

</blockquote>

<div><br class="">

</div>

<div>I agree that would be nice to catch invalid comparisons; I've considered a</div>

<div>function that would take two 'Value*'s instead of types so that the parent</div>

<div>could be determined, but I don't think that works in all cases.</div>

<div><br class="">

</div>

<div>Using an optional 'Function*' argument in a size comparison function will</div>

<div>work; I think many of the existing size queries are on scalar values in code</div>

<div>that has already explicitly checked that it's not working on aggregate types.</div>

<div><br class="">

</div>

<div>It may be a better idea to catch misuses when trying to clone instructions</div>

<div>into a different function.</div>

<br class="">

<blockquote type="cite" class="">

<div class="">

<div dir="ltr" style="font-family: Helvetica; font-size: 16px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">

<div class="gmail_extra">

<div class="gmail_quote">

<div class=""><br class="">

</div>

<blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;">

Future Work<br class="">

-----------<br class="">

<br class="">

Since we cannot determine the exact size of a scalable vector, the<br class="">

existing logic for alias detection won't work when multiple accesses<br class="">

share a common base pointer with different offsets.<br class="">

<br class="">

However, SVE's predication will mean that a dynamic 'safe' vector length<br class="">

can be determined at runtime, so after initial support has been added we<br class="">

can work on vectorizing loops using runtime predication to avoid aliasing<br class="">

problems.<br class="">

<br class="">

Alternatives Considered<br class="">

-----------------------<br class="">

<br class="">

Marking scalable vectors as unsized doesn't work well, as many parts of<br class="">

llvm dealing with loads and stores assert that 'isSized()' returns true<br class="">

and make use of the size when calculating offsets.<br class="">

</blockquote>

<div class=""><br class="">

</div>

<div class="">Seconded, I encountered those as well.<br class="">

</div>

<div class=""><br class="">

</div>

<blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;">

We have considered introducing multiple helper functions instead of<br class="">

using direct size queries, but that doesn't cover all cases. It may<br class="">

still be a good idea to introduce them to make the purpose in a given<br class="">

case more obvious, e.g. 'isBitCastableTo(Type*,Type*)'<wbr class="">.<br class="">

</blockquote>

<div class=""> </div>

+1 for clear helpers, but this sentiment is somewhat independent of the<br class="">

changes to type sizes. For example, `isBitCastableTo` seems appealing to me<br class="">

not just because it clarifies the intent of the size comparison but also<br class="">

because it can encapsulate all the cases where types have the same size but<br class="">

bitcasts still aren't allowed (ptr<->int, aggregates). With a good API for the<br class="">

{scaled, unscaled} pairs, the size comparison should not be much more complex<br class="">

than today.<br class="">

</div>

</div>

</div>

</div>

</blockquote>

<div><br class="">

</div>

<div>Agreed, it's orthogonal to the scalable vector part.</div>

<br class="">

<blockquote type="cite" class="">

<div class="">

<div dir="ltr" style="font-family: Helvetica; font-size: 16px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">

<div class="gmail_extra">

<div class="gmail_quote">

<div class=""><br class="">

Aside: I was curious, so I grepped and found that this specific predicate<br class="">

already exists under the name CastInst::isBitCastable.<br class="">

</div>

</div>

</div>

</div>

</div>

</blockquote>

<div><br class="">

</div>

<div>So it does; there's a few more cases around the codebase that don't use that function</div>

<div>though. I may create a cleanup patch for them.</div>

<div><br class="">

</div>

<div>Other possibilities might be 'requires[Sign|Zero]Extension', 'requiresTrunc', 'getLargestType',</div>

<div>etc.</div>

<div><br class="">

</div>

<br class="">

<blockquote type="cite" class="">

<div class="">

<div dir="ltr" style="font-family: Helvetica; font-size: 16px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">

<div class="gmail_extra">

<div class="gmail_quote">

<div class=""> </div>

<blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;">

==============================<wbr class="">==========<br class="">

3. Representing Vector Length at Runtime<br class="">

==============================<wbr class="">==========<br class="">

<br class="">

With a scalable vector type defined, we now need a way to represent the runtime<br class="">

length in IR in order to generate addresses for consecutive vectors in memory<br class="">

and determine how many elements have been processed in an iteration of a loop.<br class="">

<br class="">

We have added an experimental `vscale` intrinsic to represent the runtime<br class="">

multiple. Multiplying the result of this intrinsic by the minimum number of<br class="">

elements in a vector gives the total number of elements in a scalable vector.<br class="">

<br class="">

Fixed-Length Code<br class="">

-----------------<br class="">

<br class="">

Assuming a vector type of <4 x <ty>><br class="">

``<br class="">

vector.body:<br class="">

  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]<br class="">

  ;; <loop body><br class="">

  ;; Increment induction var<br class="">

  %index.next = add i64 %index, 4<br class="">

  ;; <check and branch><br class="">

``<br class="">

Scalable Equivalent<br class="">

-------------------<br class="">

<br class="">

Assuming a vector type of <scalable 4 x <ty>><br class="">

``<br class="">

vector.body:<br class="">

  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]<br class="">

  ;; <loop body><br class="">

  ;; Increment induction var<br class="">

  %vscale64 = call i64 @llvm.experimental.vector.vsca<wbr class="">le.64()<br class="">

</blockquote>

<div class=""><br class="">

I didn't see anything about this in the text (apologies if I missed<br class="">

something), but it appears this intrinsic is overloaded to be able to return<br class="">

any integer width. It's not a big deal either way, but is there a particular<br class="">

reason for doing that, rather than picking one sufficiently large integer type<br class="">

and combining it with trunc/zext as appropriate?<br class="">

</div>

</div>

</div>

</div>

</div>

</blockquote>

<div><br class="">

</div>

<div>It's a leftover from me converting from the constant, which could be of any</div>

<div>integer type.</div>

<br class="">

<blockquote type="cite" class="">

<div class="">

<div dir="ltr" style="font-family: Helvetica; font-size: 16px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">

<div class="gmail_extra">

<div class="gmail_quote">

<div class=""> <span class="Apple-converted-space"> </span><br class="">

</div>

<blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;">

  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)<br class="">

</blockquote>

<div class=""> </div>

<div class="">Just to check, is the nesting `add i64 %index, mul (i64 %vscale64, i64 4)` a<br class="">

pseudo-IR shorthand or an artifact from when vscale was proposed as a constant<br class="">

expression or something? I would have expected:<br class="">

<br class="">

```<br class="">

%vscale64 = call i64 @llvm.experimental.vector.<wbr class="">vscale.64()<br class="">

%vscale64.x4 = mul i64 %vscale64, 4<br class="">

%index.next = add i64 %index, %vscale64.x4</div>

</div>

</div>

</div>

</div>

</blockquote>

<blockquote type="cite" class="">

<div class="">

<div dir="ltr" style="font-family: Helvetica; font-size: 16px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">

<div class="gmail_extra">

<div class="gmail_quote">

<div class="">```<br class="">

</div>

</div>

</div>

</div>

</div>

</blockquote>

<div><br class="">

</div>

<div>The latter, though in this case I updated the example IR by hand so didn't catch</div>

<div>that case ;)</div>

<div><br class="">

</div>

<div>The IR in the example patch was generated by the compiler, and does split out the</div>

<div>multiply into a separate instruction (which is then strength-reduced to a shift).</div>

<br class="">

<blockquote type="cite" class="">

<div class="">

<div dir="ltr" style="font-family: Helvetica; font-size: 16px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class="">

<div class="gmail_extra">

<div class="gmail_quote">

<div class=""><br class="">

</div>

<blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;">

  ;; <check and branch><br class="">

``<br class="">

===========================<br class="">

4. Generating Vector Values<br class="">

===========================<br class="">

For constant vector values, we cannot specify all the elements as we can for<br class="">

fixed-length vectors; fortunately only a small number of easily synthesized<br class="">

patterns are required for autovectorization. The `zeroinitializer` constant<br class="">

can be used in the same manner as fixed-length vectors for a constant zero<br class="">

splat. This can then be combined with `insertelement` and `shufflevector`<br class="">

to create arbitrary value splats in the same manner as fixed-length vectors.<br class="">

<br class="">

For constants consisting of a sequence of values, an experimental `stepvector`<br class="">

intrinsic has been added to represent a simple constant of the form<br class="">

`<0, 1, 2... num_elems-1>`. To change the starting value a splat of the new<br class="">

start can be added, and changing the step requires multiplying by a splat.<br class="">

</blockquote>

<div class=""><br class="">

</div>

<div class="">+1 for making this intrinsic minimal and having canonical IR instruction<br class="">

sequences for things like stride and starting offset.<br class="">

</div>

<div class=""> </div>

<blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;">

Fixed-Length Code<br class="">

-----------------<br class="">

``<br class="">

  ;; Splat a value<br class="">

  %insert = insertelement <4 x i32> undef, i32 %value, i32 0<br class="">

  %splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer<br class="">

  ;; Add a constant sequence<br class="">

  %add = add <4 x i32> %splat, <i32 2, i32 4, i32 6, i32 8><br class="">

``<br class="">

Scalable Equivalent<br class="">

-------------------<br class="">

``<br class="">

  ;; Splat a value<br class="">

  %insert = insertelement <scalable 4 x i32> undef, i32 %value, i32 0<br class="">

  %splat = shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer<br class="">

  ;; Splat offset + stride (the same in this case)<br class="">

  %insert2 = insertelement <scalable 4 x i32> under, i32 2, i32 0<br class="">

  %str_off = shufflevector <scalable 4 x i32> %insert2, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer<br class="">

  ;; Create sequence for scalable vector<br class="">

  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.step<wbr class="">vector.nxv4i32()<br class="">

  %mulbystride = mul <scalable 4 x i32> %stepvector, %str_off<br class="">

  %addoffset = add <scalable 4 x i32> %mulbystride, %str_off<br class="">

  ;; Add the runtime-generated sequence<br class="">

  %add = add <scalable 4 x i32> %splat, %addoffset<br class="">

``<br class="">

Future Work<br class="">

-----------<br class="">

<br class="">

Intrinsics cannot currently be used for constant folding. Our downstream<br class="">

compiler (using Constants instead of intrinsics) relies quite heavily on this<br class="">

for good code generation, so we will need to find new ways to recognize and<br class="">

fold these values.<br class="">

<br class="">

==================<br class="">

5. Code Generation<br class="">

==================<br class="">

<br class="">

IR splats will be converted to an experimental splatvector intrinsic in<br class="">

SelectionDAGBuilder.<br class="">

<br class="">

All three intrinsics are custom lowered and legalized in the AArch64 backend.<br class="">

<br class="">

Two new AArch64ISD nodes have been added to represent the same concepts<br class="">

at the SelectionDAG level, while splatvector maps onto the existing<br class="">

AArch64ISD::DUP.<br class="">

<br class="">

GlobalISel<br class="">

----------<br class="">

<br class="">

Since GlobalISel was enabled by default on AArch64, it was necessary to add<br class="">

scalable vector support to the LowLevelType implementation. A single bit was<br class="">

added to the raw_data representation for vectors and vectors of pointers.<br class="">

<br class="">

In addition, types that only exist in destination patterns are planted in<br class="">

the enumeration of available types for generated code. While this may not be<br class="">

necessary in future, generating an all-true 'ptrue' value was necessary to<br class="">

convert a predicated instruction into an unpredicated one.<br class="">

<br class="">

==========<br class="">

6. Example<br class="">

==========<br class="">

<br class="">

The following example shows a simple C loop which assigns the array index to<br class="">

the array elements matching that index. The IR shows how vscale and stepvector<br class="">

are used to create the needed values and to advance the index variable in the<br class="">

loop.<br class="">

<br class="">

C Code<br class="">

------<br class="">

<br class="">

``<br class="">

void IdentityArrayInit(int *a, int count) {<br class="">

  for (int i = 0; i < count; ++i)<br class="">

    a[i] = i;<br class="">

}<br class="">

``<br class="">

<br class="">

Scalable IR Vector Body<br class="">

-----------------------<br class="">

<br class="">

``<br class="">

vector.body.preheader:<br class="">

  ;; Other setup<br class="">

  ;; Stepvector used to create initial identity vector<br class="">

  %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.step<wbr class="">vector.nxv4i32()<br class="">

  br vector.body<br class="">

<br class="">

vector.body<br class="">

  %index = phi i64 [ %index.next, %vector.body ], [ 0, %vector.body.preheader ]<br class="">

  %0 = phi i64 [ %1, %vector.body ], [ 0, %vector.body.preheader ]<br class="">

<br class="">

           ;; stepvector used for index identity on entry to loop body ;;<br class="">

  %vec.ind7 = phi <scalable 4 x i32> [ %step.add8, %vector.body ],<br class="">

                                     [ %stepvector, %vector.body.preheader ]<br class="">

  %vscale64 = call i64 @llvm.experimental.vector.vsca<wbr class="">le.64()<br class="">

  %vscale32 = trunc i64 %vscale64 to i32<br class="">

  %1 = add i64 %0, mul (i64 %vscale64, i64 4)<br class="">

<br class="">

           ;; vscale splat used to increment identity vector ;;<br class="">

  %insert = insertelement <scalable 4 x i32> undef, i32 mul (i32 %vscale32, i32 4), i32 0<br class="">

  %splat shufflevector <scalable 4 x i32> %insert, <scalable 4 x i32> undef, <scalable 4 x i32> zeroinitializer<br class="">

  %step.add8 = add <scalable 4 x i32> %vec.ind7, %splat<br class="">

  %2 = getelementptr inbounds i32, i32* %a, i64 %0<br class="">

  %3 = bitcast i32* %2 to <scalable 4 x i32>*<br class="">

  store <scalable 4 x i32> %vec.ind7, <scalable 4 x i32>* %3, align 4<br class="">

<br class="">

           ;; vscale used to increment loop index<br class="">

  %index.next = add i64 %index, mul (i64 %vscale64, i64 4)<br class="">

  %4 = icmp eq i64 %index.next, %n.vec<br class="">

  br i1 %4, label %middle.block, label %vector.body, !llvm.loop !5<br class="">

``<br class="">

<br class="">

==========<br class="">

7. Patches<br class="">

==========<br class="">

<br class="">

List of patches:<br class="">

<br class="">

1. Extend VectorType:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D32530" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D3253<wbr class="">0</a><br class="">

2. Vector element type Tablegen constraint:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47768" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4776<wbr class="">8</a><br class="">

3. LLT support for scalable vectors:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47769" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4776<wbr class="">9</a><br class="">

4. EVT strings and Type mapping:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47770" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4777<wbr class="">0</a><br class="">

5. SVE Calling Convention:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47771" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4777<wbr class="">1</a><br class="">

6. Intrinsic lowering cleanup:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47772" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4777<wbr class="">2</a><br class="">

7. Add VScale intrinsic:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47773" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4777<wbr class="">3</a><br class="">

8. Add StepVector intrinsic:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47774" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4777<wbr class="">4</a><br class="">

9. Add SplatVector intrinsic:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47775" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4777<wbr class="">5</a><br class="">

10. Initial store patterns:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47776" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4777<wbr class="">6</a><br class="">

11. Initial addition patterns:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47777" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4777<wbr class="">7</a><br class="">

12. Initial left-shift patterns:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47778" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4777<wbr class="">8</a><br class="">

13. Implement copy logic for Z regs:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47779" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4777<wbr class="">9</a><br class="">

14. Prevectorized loop unit test:<span class="Apple-converted-space"> </span><a href="https://reviews.llvm.org/D47780" rel="noreferrer" target="_blank" class="">https://reviews.llvm.org/D4778<wbr class="">0</a></blockquote>

</div>

</div>

</div>

</div>

</blockquote>

</div>

<br class="">

</div>

</body>

</html>