[PATCH] D94444: [RFC][Scalable] Add scalable shuffle intrinsic to extract evens from a pair of vectors

Thu Jan 14 08:10:44 PST 2021

cameron.mcinally added a comment.

In D94444#2497697 <https://reviews.llvm.org/D94444#2497697>, @paulwalker-arm wrote:

> A bit of a flyby review as I'm still on holidays but to my mind many of the restrictions being proposed for the new intrinsic seem purely down to the design decision of splitting the input vector across two operands.  I understand this is how the underlying instructions work for SVE but that does not seem like a good enough reason to compromise the IR.
>
> So my first questions are whether the IR and ISD interfaces need to match and from an IR point of view what is the expected usage?

My main IR use case is Complex vectorization. The vector Complex lowerings require vectors of just the reals and/or imags for the intermediate steps.

And also the trivial case of a stride 2 loop.

> Is having two input operands going to result in the common case of having to "split" the result of a large load.  I ask because I recall this being how InterleavedAccess worked with LLVM (i.e. one big load which a set of shuffles to extra the lanes).

Yeah, I could see where a large load would need to be split. That doesn't seem like too much of a headache though. We're going to do the two loads either way.

The two operand intrinsics are my preferred choice since when we vectorize loops, we want to keep full vectors. We don't want to run the loop 2x times on 1/2 full vectors, or pay the vector concatenation cost in the loop. This does map pretty well to SVE. We either do an LD2 if the operands are from memory and throw away one result, or a UZP if they're in register. Not sure how this would map to RISCV.

If we have one operand intrinsics, we'd need two UZPs for the lo and hi halves, and then a splice. I suppose ISel could combine those two patterns into a two operand UZP though. Unless someone has a better lowering?

The two operand intrinsic could also be extended to accept one undef operand. So there is some flexibility there to get the same one operand intrinsic result.

> My second question is what are the code generation advantages of having multiple operands against the negatives.  We know type legalisation is a negative but I'm guessing the advantage is it allows a simpler mapping to the underlying SVE instructions.  The question is whether this is worth the cost.
>
> By only having a single input vector I believe the current proposed type restrictions disappear as widening becomes quite easy.  The downside is that some of this type legalisation becomes more complex but this feels worth it if that means less compromises.  From an SVE point of view it seems pretty easier to rely on common type legalisation until you get to the point where the input vector is twice the size of the legal type at which point we custom lower to the relevant AArch64 specific node, which mirrors how we handle things like ZERO_EXTEND today.

I don't have a strong sense for what the trade off are. Maybe you can elaborate once you're back from vacation.

> My final question relates to future usages and how the intrinsic's idiom scales.  Taking the above InterleavedAccess example, there is a requirement to have a stride other than two, for example pixel data will want three or four.  One route is to add an intrinsic for each option but I'm wondering if there's any appetite for a single generic intrinsic of the form:
>
> <A x Elt> llvm.experimental.vector.extract.elements(<B x Elt> %invec, i32 index, i32 stride)
>
> Where index and stride are required to be constant immediate values with "stride > 0" and "0 <= index < stride".
>
> If it helps we could also initially restrict the range of stride as this is something that can be easily changed with improved code generation abilities.  By this I mean with your current patch we can restrict it to being <=2 and still have distinct ISD nodes for these supported variants if that results in the better implementation.

I like this idea a lot. Essentially a step vector shuffle. You could even roll splats into it with a 0 stride. Implementing it sounds pretty challenging though. Especially for an index >=2. Maybe I'm missing an easy solution, but that sounds like a lot of work to generalize.

Having said that, I wonder if we should revisit the idea of allowing shuffle vectors to accept step vector masks?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94444/new/

https://reviews.llvm.org/D94444