[PATCH] D94444: [RFC][Scalable] Add scalable shuffle intrinsic to extract evens from a pair of vectors

Thu Jan 14 03:36:10 PST 2021

paulwalker-arm added a comment.

A bit of a flyby review as I'm still on holidays but to my mind many of the restrictions being proposed for the new intrinsic seem purely down to the design decision of splitting the input vector across two operands.  I understand this is how the underlying instructions work for SVE but that does not seem like a good enough reason to compromise the IR.

So my first questions are whether the IR and ISD interfaces need to match and from an IR point of view what is the expected usage? Is having two input operands going to result in the common case of having to "split" the result of a large load.  I ask because I recall this being how InterleavedAccess worked with LLVM (i.e. one big load which a set of shuffles to extra the lanes).

My second question is what are the code generation advantages of having multiple operands against the negatives.  We know type legalisation is a negative but I'm guessing the advantage is it allows a simpler mapping to the underlying SVE instructions.  The question is whether this is worth the cost.

By only having a single input vector I believe the current proposed type restrictions disappear as widening becomes quite easy.  The downside is that some of this type legalisation becomes more complex but this feels worth it if that means less compromises.  From an SVE point of view it seems pretty easier to rely on common type legalisation until you get to the point where the input vector is twice the size of the legal type at which point we custom lower to the relevant AArch64 specific node, which mirrors how we handle things like ZERO_EXTEND today.

My final question relates to future usages and how the intrinsic's idiom scales.  Taking the above InterleavedAccess example, there is a requirement to have a stride other than two, for example pixel data will want three or four.  One route is to add an intrinsic for each option but I'm wondering if there's any appetite for a single generic intrinsic of the form:

<A x Elt> llvm.experimental.vector.extract.elements(<B x Elt> %invec, i32 index, i32 stride)

Where index and stride are required to be constant immediate values with "stride > 0" and "0 <= index < stride".

If it helps we could also initially restrict the range of stride as this is something that can be easily changed with improved code generation abilities.  By this I mean with your current patch we can restrict it to being <=2 and still have distinct ISD nodes for these supported variants if that results in the better implementation.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94444/new/

https://reviews.llvm.org/D94444