[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Sat Nov 26 03:49:10 PST 2016

Hi Renato,

There's more comments inline below but I wanted to highlight a couple of things.  Firstly we are not in a position to release the SVE specification at this time, we'll send a message when it's publicly available.

Related to this I want to push this and related conversations in a different direction.  From the outset our approach to add SVE support to LLVM IR has been about solving the generic problem of vectorising for an unknown vector length and then extending this to support predication.  With this in mind I would rather the problem and its solution be discussed at the IR's level of abstraction rather than getting into the guts of SVE.

I suggest this because trying to understand the nuances of mapping IR types to SVE registers, first faulting loads and predicate partition instructions without having access to the full specification is going to be painful, lead to confusion and is ultimately unnecessary.

Your example is potentially more complex than what we'll be working towards in the short to medium term.  So I apologise if some of my responses seem a little dismissive but I'm keen to keep us on point whilst we work towards getting SVE support for the simple vectorisation cases upstream.

Paul!!!

> On 25/11/2016, 13:39, "Renato Golin" <renato.golin at linaro.org> wrote:
>
> Hi Graham,
> 
> I'll look into the patches next, but first some questions after
> reading the available white papers on the net.
> 
> On 24 November 2016 at 15:39, Graham Hunter <Graham.Hunter at arm.com> wrote:
> > This complex constant represents the runtime value of `n` for any scalable type
> > `<n x m x ty>`. This is primarily used to increment induction variables and
> > generate offsets.
> 
> What do you mean by "complex constant"? Surely not Complex, but this
> is not really a constant either.

"complex constant" is the term used within the LangRef.  Although its value can be different across certain interfaces this does not need to be modelled within the IR and thus for all intents and purposes we can safely consider it to be constant.

> From what I read around (and this is why releasing the spec is
> important, because I'm basing my reviews on guess work), is that the
> length of a vector is not constant, even on the same machine.
>
> In theory, according to a post in the ARM forums (which now I forget),
> the kernel could choose the vector length per process, meaning this is
> not known even at link time.
> 
> But that's ok, because the SVE instructions completely (I'm guessing,
> again) bypass the need for that "constant" to be constant at all, ie,
> the use of `incw/incp`. Since you can fail half-way through, the width
> that you need to increment to the induction variable is not even known
> at run time! Meaning, that's not a constant at all!
>
> Example: a[i] = b[ c[i] ];
>   ld1w  z0.s, p0/z, [ c, i, lsl 2 ]
>   ld1w  z1.s, p0/z, [ b, z0.s, stxw 2 ]

This is not how speculation is handled within SVE.  This is not the context to dig into this subject so perhaps we can start a separate thread.  I ask this because speculation within the vectoriser is independent of scalable vectors.

> Now, z0.s load may have failed with seg fault somewhere, and it's up
> to the FFR to tell brka/brkb how to deal with this.
>
> Each iteration will have:
>   * The same vector length *per process* for accessing c[]
>   * A potentially *different* vector length, *per iteration*, for accessing b[]
> 
> So, while <n x m x i32> could be constant on some vectors, even at
> compile time (if we have a flag that forces certain length), it could
> be unknown *per iteration* at run time.

I am not sure what point you are trying to make.  I agree that when doing speculative loads the induction variable update is potentially different per iteration, being based on the result of the speculative load.

"vscale" is not trying to represent the result of such speculation. It's purely a constant runtime vector length multiplier.  Such a value is required by LoopVectorize to update induction variables as describe below plus simple interactions like extracting the last element of a scalable vector.

On a related note don't directly link "<n x m x ty>" types to SVE registers.  Although some will map directly we do adopt a similar approach as for non-scalable vectors in that within IR you can represent scalable vectors that are large/smaller than those directly supported by the target.

> > ```llvm
> >   %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
> > ```
> 
> Right, this would be translated to:
>   incw   x2
> 
> Now, the question is, why do we need "mul (i64 vscale, i64 4)" in the IR?

The answer is because that is how LoopVectorize updates its induction values.
For non-scalable vectors you would see:

    %index.next = add nuw nsw i64 %index, i64 4

for a VF of 4.  Why wouldn't you want to see:

    %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)

for a VF of "n*4" (remembering that vscale is the "n" in "<n x 4 x Ty>")

> There is no semantic analysis you can do on a value that can change on
> every iteration of the loop. You can't elide, hoist, combine or const
> fold.
> 
> If I got it right (from random documents on the web), `incX` relates
> to a number of "increment induction" functionality. `incw` is probably
> "increment W", ie. 32-bits, while `incp` is "increment predicate", ie.
> whatever the size of the predicate you use:
>
> Examples:
>   incw  x2          # increments x2 to 4*(FFR successful lanes)
>   incp  x2, p0.b  # increments x2 to 1*(FFR successful lanes)
> 
> So, this IR semantics is valid for the second case, but irrelevant for
> the second. Also, I'm worried that we'll end up ignoring the
> multiplier altogether, if we change the vector types (from byte to
> word, for example), or make the process of doing so more complex.

As mentioned above I'd rather not describe the details of SVE instructions at this time because it'll only distract from the generic IR representation we are aiming for.

> > The following shows the construction of a scalable vector of the form
> > <start, start-2, start-4, ...>:
> >
> > ```llvm
> >   %elt = insertelement <n x 4 x i32> undef, i32 %start, i32 0
> >   %widestart = shufflevector <n x 4 x i32> %elt, <n x 4 x i32> undef, <n x 4 x i32> zeroinitializer
> >   %step = insertelement <n x 4 x i32> undef, i32 -2, i32 0
> >   %widestep = shufflevector <n x 4 x i32> %step, <n x 4 x i32> undef, <n x 4 x i32> zeroinitializer
> >   %stridevec = mul <n x 4 x i32> stepvector, %widestep
> >   %finalvec = add <n x 4 x i32> %widestart, %stridevec
> > ```
> 
> This is really fragile and confusing, and I agree with James, an
> intrinsic here would be *much* better.
> 
> Something like
> 
> %const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, i32 %step)

This intrinsic matches the seriesvector instruction we original proposed.  However, on reflection we didn't like how it allowed multiple representations for the same constant.  Instead we prefer "stepvector" to better allow a single canonical form for scalable vectors.

I know this doesn't preclude the use of an intrinsic, I just wanted to highlight that doing so doesn't automatically change the surrounding IR.

We wonder if this canonical form is worth explicitly using across all vector types to maintain a single code path (e.g. GEP related IR matching for strided access patterns) and to allow a prettier textual IR (e.g. a non-scalable 1024bit vector of bytes means a 128 entry constant vector to type).

> cheers,
> --renato