[PATCH] D94444: [RFC][Scalable] Add scalable shuffle intrinsic to extract evens from a pair of vectors

Sun Jan 24 04:08:12 PST 2021

paulwalker-arm added a comment.

For now I'll just cover the IR side of things as the ISD node discussion raises different points and there's nothing to say they need to match.

If you take your code snippet (although I changed the loop trip count to 1024 to allow vectorisation) and look at the IR emitted by LoopVectorize, you'll see what I was referring to in my previous comment.  You end up with the following snippet within vector.body:

  %wide.vec = load <4 x double>, <4 x double>* %11, align 8, !tbaa !6
  %wide.vec23 = load <4 x double>, <4 x double>* %13, align 8, !tbaa !6
  %strided.vec = shufflevector <4 x double> %wide.vec, <4 x double> poison, <2 x i32> <i32 0, i32 2>
  %strided.vec24 = shufflevector <4 x double> %wide.vec23, <4 x double> poison, <2 x i32> <i32 0, i32 2>

So today the loop is vectorised using vectors as full as possible, in this case the loop was also unrolled hence the pair of loads and shuffles.  Here  LoopVectorize simple creates a double length load and a matching shuffle to extract the even lanes. If the loop operated on the imaginary parts then there would also be a shuffle to extra the odd lanes. There's no concatenation or splicing involved and the "large" load it trivial to code generate.  For AArch64 there is also the InterleavedAccess pass that knows how to convert this logic to an `aarch64.ld2` intrinsic call.  This is something we'll want for SVE as well, although with the shufflevector replaced by an intrinsic it'll be simpler for SVE to detect as InterleavedAccess is a tad complicated.

This is why I believe at the IR level we should have an intrinsic that mirrors this type of shuffle and thus one that takes a single vector and extracts elements based on a simple pattern (i.e. odd or even....).  Doing so means it'll be a drop in replacement for the existing shufflevector usage, which is the goal. Note that if complex was changed to a three element structure, then LoopVectorize will do the expected thing in creating a triple wide load and create shuffles to extract every third element starting at index 0, 1, or 2 based on the field in question.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D94444/new/

https://reviews.llvm.org/D94444