[llvm-dev] [RFC][SVE] Supporting SIMD instruction sets with variable vector lengths

Mon Jul 30 19:53:14 PDT 2018

Renato Golin <renato.golin at linaro.org> writes:

> On Mon, 30 Jul 2018 at 20:57, David A. Greene via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> I'm not sure exactly how the SVE proposal would address this kind of
>> operation.
>
> SVE uses predication. The physical number of lanes doesn't have to
> change to have the same effect (alignment, tails).

Right.  My wording was poor.  The current proposal doesn't directly
support a more dynamic vscale target but I believe it could be simply
extended to do so.

>> I think it would be unlikely for anyone to need to change the vector
>> length during evaluation of an in-register expression.
>
> The worry here is not within each instruction but across instructions.
> SVE (and I think RISC-V) allow register size to be dynamically set.

I wasn't talking about within an instruction but rather across
instructions in the same expression tree.  Something like this would be
weird:

A = load with VL
B = load with VL
C = A + B           # VL implicit
VL = <something>
D = ~C              # VL implicit
store D

Here and beyond, read "VL" as "vscale with minimum element count 1."

The points where VL would be changed are limited and I think would
require limited, straightforward additions on top of this proposal.

> For example, on the same machine, it may be 256 for one process and
> 512 for another (for example, to save power).

Sure.

> But the change is via a system register, so in theory, anyone can
> write an inline asm in the beginning of a function and change the
> vector length to whatever they want.
>
> Worst still, people can do that inside loops, or in a tail loop,
> thinking it's a good idea (or this is a Cray machine :).
>
> AFAIK, the interface for changing the register length will not be
> exposed programmatically, so in theory, we should not worry about it.
> Any inline asm hack can be considered out of scope / user error.

That's right.  This proposal doesn't expose a way to change vscale, but
I don't think it precludes a later addition to do so.

> However, Hal's concern seems to be that, in the event of anyone
> planning to add it to their APIs, we need to make sure the proposed
> semantics can cope with it (do we need to update the predicates again?
> what will vscale mean, then and when?).

I don't see why predicate values would be affected at all.  If a machine
with variable vector length has predicates, then typically the resulting
operation would operate on the bitwise AND of the predicate and a
conceptual all 1's predicate of length VL.

As I understand it, vscale is the runtime multiple of some minimal,
guaranteed vector length.  For SVE that minimum is whatever gives a bit
width of 128.  My guess is that for a machine with a more dynamic vector
length, the minimum would be 1.  vscale would then be the vector length
and would change accordingly if the vector length is changed.

Changing vscale would be no different than changing any other value in
the program.  The dataflow determines its possible values at various
program points.  vscale is an extra (implicit) operand to all vector
operations with scalable type.

> If not, we may have to enforce that this will not come to pass in its
> current form.

Why?  If a user does asm or some other such trick to change what vscale
means, that's on the user.  If a machine has a VL that changes
iteration-to-iteration, typically the compiler would be responsible for
controlling it.

If the vendor provides some target intrinsics to let the user write
low-level vector code that changes vscale in a high-level language, then
the vendor would be responsible for adding the necessary bits to the
frontend and LLVM.  I would not recommend a vendor try to do this.  :)
It wouldn't necessarily be hard to do, but it would be wasted work IMO
because it would be better to improve the vectorizer that already
exists.

> In this case, changing it later will require *a lot* more effort than
> doing it now.

I don't see why.  Anyone adding ability to change vscale would need to
add intrinsics and specify their semantics.  That shouldn't change
anything about this proposal and any such additions shouldn't be
hampered by this proposal.

Another way to think of vscale/vector length is as a different kind of
predicate.  Right now LLVM uses select to track predicate application.
It uses a "top-down" approach in that the root of an expression tree (a
select) applies the predicate and presumably everything under it
operates under that predicate.  It also uses intrinsics for certain
operations (loads, stores, etc.) that absolutely must be predicated no
matter what for safety reasons.  So it's sort of a hybrid approach, with
predicate application at the root, certain leaves and maybe even on
interior nodes (FP operations come to mind).

To my knowledge, there's nothing in LLVM that checks to make sure these
predicate applications are all consistent with one another.  Someone
could do a load with predicate 0011 and then a "select div" with
predicate 1111, likely resulting in a runtime fault but nothing in LLVM
would assert on the predicate mismatch.

Predicates could also be applied only at the leaves and propagated up
the tree.  IIRC, Dan Gohman proposed something like this years back when
the topic of predication came up.  He called it "applymask" but
unfortunately the Google is failing to find it.  

I *could* imagine using select to also convey application of vector
length but that seems odd and unnecessarily complex.

If vector length were applied at the leaves, it would take a bit of work
to get it through instruction selection.  Target opcodes would be one
way to do it.  I think it would be straightforward to walk the DAG and
change generic opcodes to target opcodes when necessary.

I don't think we should worry about taking IR with dynamic changes to VL
and trying to generate good code for any random target from it.  Such IR
is very clearly tied to a specific kind of target and we shouldn't
bother pretending otherwise.  The vectorizer should be aware of the
target's capabilities and generate code accordingly.

                        -David