[PATCH] D141924: [IR] Add new intrinsics interleave and deinterleave vectors

Mon Jan 30 09:09:52 PST 2023

sdesmalen added a comment.

In D141924#4075634 <https://reviews.llvm.org/D141924#4075634>, @paulwalker-arm wrote:

> What about implementing total shuffles and breaking the dependence on vector types having any meaning, with this encoded as a discrete immediate operand instead. For example:
>
>   Ty A = @llvm.experimental.vector.interleave.Ty(Vec, shape)
>   Ty B = @llvm.experimental.vector.deinterleave.Ty(Vec, shape)
>
> Here the intrinsics are simply vector in vector out with all input lanes existing in the output just at a different location (this is what I mean by a total shuffle).  If only part of the result is important to the caller then they'll just extract the part they need. Here `shape` essentially refers to the number of subvectors that are logically contained within Vec(interleave) or B(deinterleave) and for this initial implementation we'd restrict support to just the value `2`.  The main usage rule is `Ty.getKnownMinElementCount()` must be devisable by `shape`.
>
> Do you see any issues here? My thinking is that it becomes trivial to see how we'd support other strides (i.e. we'd just extend the verifier to allow shape=new-stride).

Interleaving requires two input vectors, so in order to do an interleave operation with the single-operand/result intrinsic, two input vectors will need to be concatenated (in IR) using llvm.vector.insert operations. There may be issues when these operations are hoisted to a different block, in the sense that the EXTRACT_SUBVECTOR nodes inserted as part of lowering to the ISD nodes are not folded away and result in actual code.

If we implement a 'total shuffle' with multiple operands/results (resulting in both lo/hi for interleave and even/odd for deinterleave), then we avoid having to insert artificial vector.insert/extract operations and it also becomes more clear which part of the calculation is used. Individual results are stored in separate virtual registers which I suspect may make it easier for the compiler to know what parts to deadcode, especially when the 'concat' of the results would otherwise lead to actual instructions (e.g. for SVE concat(<vscale x 2 x i32>,<vscale x 2 x i32>) -> <vscale x 4 x i32>).

Do you see any specific advantages to having a single vector operand/result?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D141924/new/

https://reviews.llvm.org/D141924