[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Fri Jun 15 13:51:50 PDT 2018

Graham Hunter via llvm-dev <llvm-dev at lists.llvm.org> writes:

> To split a <scalable 2 x double> in half, you'd use a shufflevector in much the
> same way you would for fixed-length vector types.
>
> e.g.
> ``
> %sv = call <scalable 1 x i32> @llvm.experimental.vector.stepvector.nxv1i32()
> %halfvec = shufflevector <scalable 2 x double> %fullvec, <scalable 2 x double> undef, <scalable 1 x i32> %sv
> ``
>
> You can't split it any further than a <scalable 1 x <ty>>, since there may only be
> one element in the actual hardware vector at runtime. The same restriction applies to
> a <1 x <ty>>. This is why we have a minimum number of lanes in addition to the
> scalable flag so that we can concatenate and split vectors, since SVE registers have
> the same number of bytes and will therefore decrease the number of elements per
> register as the element type increases in size.

Right.  So let's say the hardware width is 1024.  If I have a
<scalable 2 x double> it is 1024 bits.  If I split it, it's a
<scalable 1 x double> (right?) with 512 bits.  There is no
way to create a 256-bit vector.

It's probably the case that for pure VL-agnostic code, this is ok.  Our
experience with the X1/X2, which also supported VL-agnostic code, was
that at times compiler awareness of the hardware MAXVL allowed us to
generate better code, better enough that we "cheated" regularly.  The
hardware guys loved us.  :)

I'm not at all saying that's a good idea for SVE, just recounting
experience and wondering what the implications would be for SVE and more
generally, LLVM IR.  Would the MIPS V people be interested in a
non-VL-agnostic compilation mode?

> If you want to extract something other than the first part of a vector, you need to add
> offsets based on a calculation from vscale (e.g. adding vscale * (min_elts/2) allows you
> to reach the high half of a larger register).

Sure, that makes semse.

> For floating point types, we do use predication to allow the use of otherwise illegal
> types like <scalable 1 x double>, but that's limited to the AArch64 backend and does
> not need to be represented in IR.

This is something done during or after isel?

>  Ths split question comes into play for backward compatibility. How
>  would one take a scalable vector and pass it into a NEON library? It is
>  likely that some math functions, for example, will not have SVE versions
>  available.
>
> I don't believe we intend to support this, but instead provide libraries with
> SVE versions of functions instead. The problem is that you don't know how
> many NEON-size subvectors exist within an SVE vector at compile time.
> While you could create a loop with 'vscale' number of iterations and try to
> extract those subvectors, I suspect the IR would end up being quite messy
> and potentially hard to recognize and optimize.

Yes, that was my concern.  The vscale loop is what I came up with as
well.  It is technically feasible, but ugly.  I'm a little concerned
about what vendors will do with this.  Not everyone is going to have the
resources to convert all of their NEON libraries, certainly not all at
once.

Just something to think about.

> The other problem with calling non-SVE functions is that any live SVE
> registers must be spilled to the stack and filled after the call, which is
> likely to be quite expensive.

Understood.

>  Is there a way to represent "double width" vectors? In mixed-data-size
>  loops it is sometimes convenient to reason about double-width vectors
>  rather than having to split them (to legalize for the target
>  architecture) and keep track of their parts early on. I guess the more
>  fundamental question is about how such loops should be handled.
>
> For SVE, it's fine to generate IR with types that are 'double size' or larger,
> and just leave it to legalization at SelectionDAG level to split into multiple
> legal size registers.

Ok, great.  If something is larger than "double size," how can it be
split, given the "split once" restriction above?

>  What do insertelement and extractelement mean for scalable vectors?
>  Your examples showed insertelement at index zero. How would I, say,
>  insertelement into the upper half of the vector? Or any arbitrary
>  place? Does insertelement at index 10 of a <scalable 2 x double> work,
>  assuming vscale is large enough? It is sometimes useful to constitute a
>  vector out of various scalar pieces and insertelement is a convenient
>  way to do it.
>
> So you can insert or extract any element known to exist (in other words, it's
> within the minimum number of elements). Using a constant index outside
> that range will fail, as we won't know whether the element actually exists
> until we're running on a cpu.

In that case to "insert" into the higher elements one would insert into
the known range and then shufflevector, I suppose.  Ok.

> Our downstream compiler supports inserting and extracting arbitrary elements
> from calculated offsets as part of our experiment on search loop vectorization,
> but that generates the offsets based on a count of true bits within partitioned
> predicates. I was planning on proposing new intrinsics to improve predicate use
> within llvm at a later date.

Ok, I look forward to seeing them!

> We have been able to implement various types of known shuffles (like the high/low
> half extract, zip, concatention, etc) with vscale, stepvector, and the existing IR
> instructions.

Yes, I can certainly see how all of those would be implemented.  The
main case I'm thinking about is something that is "scalarized" within a
vector loop context.  I'm wondering about the best way to reconstitute a
vector from scalar pieces (or vice-versa).

Thanks for the explanations!

                          -David