[llvm-dev] [RFC] Supporting ARM's SVE in LLVM

Fri Nov 25 05:39:08 PST 2016

Hi Graham,

I'll look into the patches next, but first some questions after
reading the available white papers on the net.

On 24 November 2016 at 15:39, Graham Hunter <Graham.Hunter at arm.com> wrote:
> This complex constant represents the runtime value of `n` for any scalable type
> `<n x m x ty>`. This is primarily used to increment induction variables and
> generate offsets.

What do you mean by "complex constant"? Surely not Complex, but this
is not really a constant either.

>From what I read around (and this is why releasing the spec is
important, because I'm basing my reviews on guess work), is that the
length of a vector is not constant, even on the same machine.

In theory, according to a post in the ARM forums (which now I forget),
the kernel could choose the vector length per process, meaning this is
not known even at link time.

But that's ok, because the SVE instructions completely (I'm guessing,
again) bypass the need for that "constant" to be constant at all, ie,
the use of `incw/incp`. Since you can fail half-way through, the width
that you need to increment to the induction variable is not even known
at run time! Meaning, that's not a constant at all!

Example: a[i] = b[ c[i] ];
  ld1w  z0.s, p0/z, [ c, i, lsl 2 ]
  ld1w  z1.s, p0/z, [ b, z0.s, stxw 2 ]

Now, z0.s load may have failed with seg fault somewhere, and it's up
to the FFR to tell brka/brkb how to deal with this.

Each iteration will have:
  * The same vector length *per process* for accessing c[]
  * A potentially *different* vector length, *per iteration*, for accessing b[]

So, while <n x m x i32> could be constant on some vectors, even at
compile time (if we have a flag that forces certain length), it could
be unknown *per iteration* at run time.

> ```llvm
>   %index.next = add nuw nsw i64 %index, mul (i64 vscale, i64 4)
> ```

Right, this would be translated to:
  incw   x2

Now, the question is, why do we need "mul (i64 vscale, i64 4)" in the IR?

There is no semantic analysis you can do on a value that can change on
every iteration of the loop. You can't elide, hoist, combine or const
fold.

If I got it right (from random documents on the web), `incX` relates
to a number of "increment induction" functionality. `incw` is probably
"increment W", ie. 32-bits, while `incp` is "increment predicate", ie.
whatever the size of the predicate you use:

Examples:
  incw  x2          # increments x2 to 4*(FFR successful lanes)
  incp  x2, p0.b  # increments x2 to 1*(FFR successful lanes)

So, this IR semantics is valid for the second case, but irrelevant for
the second. Also, I'm worried that we'll end up ignoring the
multiplier altogether, if we change the vector types (from byte to
word, for example), or make the process of doing so more complex.

> The following shows the construction of a scalable vector of the form
> <start, start-2, start-4, ...>:
>
> ```llvm
>   %elt = insertelement <n x 4 x i32> undef, i32 %start, i32 0
>   %widestart = shufflevector <n x 4 x i32> %elt, <n x 4 x i32> undef, <n x 4 x i32> zeroinitializer
>   %step = insertelement <n x 4 x i32> undef, i32 -2, i32 0
>   %widestep = shufflevector <n x 4 x i32> %step, <n x 4 x i32> undef, <n x 4 x i32> zeroinitializer
>   %stridevec = mul <n x 4 x i32> stepvector, %widestep
>   %finalvec = add <n x 4 x i32> %widestart, %stridevec
> ```

This is really fragile and confusing, and I agree with James, an
intrinsic here would be *much* better.

Something like

%const_vec = <n x 4 x i32> @llvm.sve.constant_vector(i32 %start, i32 %step)

cheers,
--renato