[PATCH] D152205: [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types.

Wed Aug 23 03:15:01 PDT 2023

sdesmalen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25671-25672
+static SDValue GenerateFixedLengthSVETBL(SDValue Op, SDValue Op1, SDValue Op2,
+                                         unsigned MinSVESize,
+                                         unsigned MaxSVESize,
+                                         ArrayRef<int> ShuffleMask,
----------------
nit: You don't need to pass MinSVESize and MaxSVESize, you can easily get this again from the Subtarget.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25722
+      TBLMask.push_back(DAG.getConstant(Val - NumElts, DL, MVT::i64));
+    std::swap(Op1, Op2);
+  } else {
----------------
This swap code seems to be unnecessary, because SelectionDAGBuilder already does this for you.

You can see that if you compile:
```
define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) vscale_range(16, 16) {
  %op0 = load <16 x i8>, ptr %ptr0
  %op1 = load <16 x i8>, ptr %ptr1
  ; Takes only elements from second vector
  %res = shufflevector <16 x i8> %op0, <16 x i8> %op1, <16 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 31, i32 31>
  ret <16 x i8> %res
}
```
Which results in:
```
  t0: ch,glue = EntryToken
        t4: i64,ch = CopyFromReg t0, Register:i64 %1
      t8: v16i8,ch = load<(load (s128) from %ir.ptr1)> t0, t4, undef:i64
    t10: v16i8 = vector_shuffle<0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,15> t8, undef:v16i8
  t12: ch,glue = CopyToReg t0, Register:v16i8 $q0, t10
  t13: ch = AArch64ISD::RET_GLUE t12, Register:v16i8 $q0, t12:1
```

That means you don't need to modify the Shufflemask at all, and you can call `ShuffleVectorInst:: isSingleSourceMask(ShuffleMask)` to calculate `IsSingleOp`. If `isSingleSourceMask` returns `true`, then you can assume it will select elements from the first source operand. That will simplify this patch quite a bit!

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25868
+      // of the shufflevector, thus we are rejecting this transform.
+      (Subtarget->hasSVE2() && IndexLen < (MaxOffset / 2) &&
+       MinSVESize == MaxSVESize)) {
----------------
dtemirbulatov wrote:
> sdesmalen wrote:
> > This condition and the comment above make little sense to me. The issue to solve is instead about the maximum *index* value in the shuffle mask. If the maximum value being indexed cannot be represented in an i8, then we cannot use TBL2 to lower this shuffle vector.
> > 
> > For example:
> > 
> > ```   
> > define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) {
> >   %op0 = load <256 x i8>, ptr %ptr0
> >   %op1 = load <256 x i8>, ptr %ptr1
> >   ; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8!
> >   %res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0,i32 1,i32 2,i32 3,i32 4, i32 5, i32 6, i32 7, i32 256,i32 257, i32 258, i32 259, i32 260, i32 261, i32 262, i32 263>
> >   ret <16 x i8> %res
> > }
> > ```
> > 
> > However, the code in the patch seems to accept this without an issue and generates wrong code with the wrong indices.
> > ```
> > .LCPI0_0:
> >         .byte   0                               // 0x0
> >         .byte   1                               // 0x1
> >         .byte   2                               // 0x2
> >         .byte   3                               // 0x3
> >         .byte   4                               // 0x4
> >         .byte   5                               // 0x5
> >         .byte   6                               // 0x6
> >         .byte   7                               // 0x7
> >         .byte   16                              // 0x10
> >         .byte   33                              // 0x21
> >         .byte   34                              // 0x22
> >         .byte   35                              // 0x23
> >         .byte   36                              // 0x24
> >         .byte   37                              // 0x25
> >         .byte   38                              // 0x26
> >         .byte   39                              // 0x27
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .byte   255                             // 0xff
> >         .text
> >         .globl  foo
> >         .p2align        2
> >         .type   foo, at function
> > 
> > foo:                                    // @foo
> >         .cfi_startproc
> > // %bb.0:
> >         adrp    x8, .LCPI0_0
> >         add     x8, x8, :lo12:.LCPI0_0
> >         ptrue   p0.b
> >         ldr     q1, [x1]
> >         ldr     q0, [x0]
> >         ld1b    { z2.b }, p0/z, [x8]
> >         tbl     z0.b, { z0.b, z1.b }, z2.b
> >                                         // kill: def $q0 killed $q0 killed $z0
> >         ret
> > ```
> I think this is correct, here is full DAG dump:
> SelectionDAG has 16 nodes:
>   t0: ch,glue = EntryToken
>     t2: i64,ch = CopyFromReg t0, Register:i64 %0
>   t18: v16i8,ch = load<(load (s128) from %ir.ptr0, align 256)> t0, t2, undef:i64
>     t4: i64,ch = CopyFromReg t0, Register:i64 %1
>   t16: v16i8,ch = load<(load (s128) from %ir.ptr1, align 256)> t0, t4, undef:i64
>     t11: v16i8 = vector_shuffle<0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23> t18, t16
>   t13: ch,glue = CopyToReg t0, Register:v16i8 $q0, t11
>   t22: nxv16i8 = insert_subvector undef:nxv16i8, t18, Constant:i64<0>
>   t23: nxv16i8 = insert_subvector undef:nxv16i8, t16, Constant:i64<0>
>   t14: ch = AArch64ISD::RET_GLUE t13, Register:v16i8 $q0, t13:1
> 
It seems SelectionDAGBuilder has extracted a subregister from the two <256 x i8> vectors, hence why the offsets are valid. This clearly wasn't the right example. The following case can't be transformed like that:

```define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) vscale_range(16, 16) {
  %op0 = load <256 x i8>, ptr %ptr0
  %op1 = load <256 x i8>, ptr %ptr1
  ; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8!
  %res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 252, i32 253, i32 254, i32 255, i32 256, i32 257, i32 258, i32 259, i32 508, i32 509, i32 510, i32 511>
  ret <16 x i8> %res
}
```

It's only because SelectionDAG inserts the mask into a `<256 x i8>` vector that your code happens to discard this case (because that will fail the check `MaxOffset <= NumElts * 2`). I think I'd still prefer an explicit check that the maximum index value in the ShuffleMask doesn't exceed the maximum value of the element type.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152205/new/

https://reviews.llvm.org/D152205