[llvm-dev] [RFC] Extending shufflevector for vscale vectors (SVE etc.)

Sun Feb 2 11:39:42 PST 2020

On Sun, 2 Feb 2020 at 08:57, Nicolai Hähnle <nhaehnle at gmail.com> wrote:
> > For example, I wouldn't want to let the vectoriser generate any random
> > pattern in the shuffle if I know that there is no valid instruction in
> > the back-end that can cope with that, and I'll end up with
> > under-performing code.
>
> How is any of this different from non-vscale shufflevector?

This specific point is not. It's a consequence of getting both
scalable and fixed shuffles wrong.

My argument is that getting scalable shuffles wrong is harder to
recover than fixed-size ones.

Fixed vectors will have a number of insert/extract element that are
known at compile time, while scalable vectors will have to add a
runtime stub or equivalent.

> It feels to me that if you're not willing to do a natural extension of
> what's in shufflevector already, then going with an intrinsic for the
> time being is the wiser choice.

Using a simple mask is not trivial in scalable vectors because you
don't know the number of elements. What to do if the mask is smaller
or larger than the actual register, and not a multiple, etc?

Expressions are easier to check at compile time, because if they are
valid for all n in (0..N), then they are valid for a subset, whatever
the chunk size, if multiple. But what is a valid expression?

For example, can we add calls to that expression if the function can
be known at compile time? Do we really need to? If not *any*
expression, how do we restrict the set of valid operations and where
does that code goes.

None of those questions are too hard to answer, but I fear we can
spend more time discussing the semantics of the expression and what's
allowed in there than the actual implementation.

If we really *have* to, then we have to. But if the set of shuffles
proposed are sufficient for all scalable extensions in existence for
the foreseeable future, then it should be fine like that.

Having said that, I don't see anything wrong with implementing this
with intrinsics for now, if people feel there are some cases that we
cannot cover using a small list of cases.