[PATCH] D113376: [AArch64][SVE] Lower shuffles to permute instructions: zip1/2, uzp1/2, trn1/2

Thu Dec 9 08:09:09 PST 2021

paulwalker-arm added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:19267-19268
+  // like zip/uzp/trn/rev from having some wrong or undefined behavior.
+  if (Subtarget->getMinSVEVectorSizeInBits() ==
+      Subtarget->getMaxSVEVectorSizeInBits()) {
+    if (SDValue PermOp =
----------------
wwei wrote:
> paulwalker-arm wrote:
> > I still don't think this is enough as you've missed my previous `I doubt the logic holds for the case when a vector is legal but not the exact size of the target vector register` comment.  Further discussion is attached to the test zip_v32i8 below...
> > 
> > I'm starting to wonder if it's worth breaking out lowerShuffleToZIP_UZP_TRN as I feel like once support is added for more of the combinations there's not likely to be much reuse.  For example taking the zip1/zip2 case I think supporting zip1 is likely simple and just works for all legal VTs, but zip2 requires the VT to be exactly the size of the target vector (i.e. `VT.getSizeInBits() == Subtarget->getMinSVEVectorSizeInBits() == Subtarget->getMaxSVEVectorSizeInBits()`) or perhaps a different instruction sequence.
> Thanks, I agree with you. Perhaps the current implementation of `lowerShuffleToZIP_UZP_TRN` reusing neon code is not so good, since SVE may have variable length.
> Maybe we should identify and classify different shuffle patterns. For zip/uzp/trn/rev these four shuffle types, we can roughly divide them into two categories:
> **zip1/uzp1/uzp2/trn1/trn2/revb/revh/revw** （it' simple，and works for all legal VTs）
> **zip2/rev**  （complex，and need consider high half or whole register）
> I think we can implement these two scenarios in steps.
Looking at the six shuffles trn1, trn2, uzp1, uzp2, zip1, zip2 I believe that when min_length != max_length != VT.getSizeInBits() only trn1, trn2 & zip1 are valid.  We agree that zip2 is not valid because the top half of the input fixed length vector does not necessarily map to the top half of a scalable vector.  The reason I believe both uzp variants are also invalid is because their underlying operation is to concat both input vectors and then extract every second element.  However, the vectors we're concatenating may contain junk so you end up with:

```
expected uzp1 [A | B | C | D], [E | F | G | H] => [A | C | E | G]

SVE_VLS uzp [A | B | C | D | ? | ? | ? | ?], [E | F | G | H | ? | ? | ?] => [A | C | ? | ? | E | G | ? | ?]
```

And thus when you extract the fixed length result you'll end up with `[A | C | ? | ?]`

This I think further reduces the value of having `lowerShuffleToZIP_UZP_TRN` as it would be clearer to just handle each of the scenarios separately.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D113376/new/

https://reviews.llvm.org/D113376