[llvm-dev] [RFC] Extending shufflevector for vscale vectors (SVE etc.)

Wed Jan 29 16:48:02 PST 2020

Currently, for scalable vectors, only splat shuffles are allowed; we're considering allowing more different kinds of shuffles.  The issue is, essentially, that a shuffle mask is a simple list of integers, and that isn't enough to express a scalable operation.  For example, concatenating two fixed-length vectors currently looks like this:

shufflevector <2 x i32> %v1, <2 x i32> %v2, <4 x i32> <i32 0, i32 1, i32 2, i32 3>

(Note that despite the syntax, the mask is really just a list of constant integers, not a general constant expression.)

There isn't any obvious way to extend this to variable shuffles. The mask would need to have "vscale * N" elements, and it's impossible to write out all the elements of a list of unknown size; it would need to be generated with some sort of formula.

The motivation here is to express various common patterns that are useful for vectorization and lowering SVE intrinsics, in a target-independent manner:

1.Unzipping/unpacking interleaved data
2.Zipping/packing interleaved data
3.Reversing a vector
4.Concatenating two vectors, or splitting a vector into halves.

This isn't a comprehensive list of all shuffles we might want, but it's a reasonable starting point.  The way the proposal is written, it should be easy to extend if we add more shuffles.

Proposed IR syntax:

%result = shufflevector <vscale x 4 x i32> %v1, <vscale x 4 x i32> %v2, SHUFFLE_NAME

SHUFFLE_NAME can be one of the following (with examples of the equivalent <4 x i32> shuffles):
splat - Splat element 0 of the first operand. (<0, 0, 0, 0>)
reverse - Reverse the elements of the first operand (<3, 2, 1, 0>)
concat - Concatenate the two operands (<0, 1, 2, 3, 4, 5, 6, 7>)
split_low - Return the low half of the first operand (<0, 1>)
split_high - Return the high half of the first operand (<2, 3>)
zip_low - Zip together the low halves of the two operands (<0, 4, 1, 5>)
zip_high - Zip together the high halves of the two operands (<2, 6, 3, 7>)
unzip_even - Unzip the even elements of the two operands (<0, 2, 4, 6>)
unzip_odd - unzip the odd elements of the two operands (<1, 3, 5, 7>)

On SVE targets, all of these shuffles can be lowered to a single instruction for vector types which fit in a single register.  I expect that other scalable vector instruction sets will also support these operations, since they're important for many use cases.  (If we end up in some weird situation where it's necessary, we could expand a shuffle into an extractelement/insertelement loop, similar to what we do for unsupported fixed-length shuffles. But the vectorizer would probably avoid generating unsupported shuffles, so it's unlikely to come up in practice.)

In C++, I expect to represent this list as an enum, and then wrap up the entire description of a fixed or scalable shuffle into a class "ShuffleMask".  This would allow checking whether an operation matches one of the above patterns, and can be converted to the existing ArrayRef<int> for fixed shuffles.  ShuffleVectorInst::getShuffleMask would then return a ShuffleMask, I think. Then we could add an alternate API getFixedShuffleMask() that only works for fixed shuffles, and just returns the fixed mask as an ArrayRef<int>.

I'm working on refactoring the existing shufflevector handling in LLVM to allow making these changes.  See https://reviews.llvm.org/D72467 .  I haven't tried implementing this proposal yet, though.

Alternatives:

Instead of extending shufflevector, we could introduce a dedicated intrinsic for each common shuffle.  This is less readable, and makes it harder to leverage existing code that reasons about shuffles.  But it would mean fewer changes to existing code.

We could make shufflevector take a variable shuffle mask operand (any Value*).  This has been proposed before.  This introduces a lot of complexity: the general case is hard to lower on most targets, and it becomes harder to pattern-match the special cases we actually care about.  (It's possible we want to expose this as a target-independent intrinsic at some point, but that wouldn't really serve as a substitute for what's described here.)

We could come up with some way to represent a formula for generating a shuffle mask, instead of only allowing specific, known, shuffles.  I'm not sure how to do that in a way that's both straightforward to reason about, and covers all the cases we care about.  Or also along these lines, we could specify a shuffle mask as a tree  of operations: for example, "zip(first_vector, reverse(second_vector))".

We can add more shuffles to the list.  There are a few SVE shuffle instructions which are not equivalent to any of the basic operations I've listed: ext and trn. And it's possible to construct other shuffles out of sequences of multiple instructions.  (Some of the additional shuffles we could specify would require an integer parameter, or an integer parameter list; it should be possible to support that without any major changes to the proposal.)

-Eli