[PATCH] D152205: [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types.

Tue Aug 22 15:35:13 PDT 2023

dtemirbulatov marked an inline comment as done.
dtemirbulatov added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25868
+      // of the shufflevector, thus we are rejecting this transform.
+      (Subtarget->hasSVE2() && IndexLen < (MaxOffset / 2) &&
+       MinSVESize == MaxSVESize)) {
----------------
sdesmalen wrote:
> This condition and the comment above make little sense to me. The issue to solve is instead about the maximum *index* value in the shuffle mask. If the maximum value being indexed cannot be represented in an i8, then we cannot use TBL2 to lower this shuffle vector.
> 
> For example:
> 
> ```   
> define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) {
>   %op0 = load <256 x i8>, ptr %ptr0
>   %op1 = load <256 x i8>, ptr %ptr1
>   ; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8!
>   %res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0,i32 1,i32 2,i32 3,i32 4, i32 5, i32 6, i32 7, i32 256,i32 257, i32 258, i32 259, i32 260, i32 261, i32 262, i32 263>
>   ret <16 x i8> %res
> }
> ```
> 
> However, the code in the patch seems to accept this without an issue and generates wrong code with the wrong indices.
> ```
> .LCPI0_0:
>         .byte   0                               // 0x0
>         .byte   1                               // 0x1
>         .byte   2                               // 0x2
>         .byte   3                               // 0x3
>         .byte   4                               // 0x4
>         .byte   5                               // 0x5
>         .byte   6                               // 0x6
>         .byte   7                               // 0x7
>         .byte   16                              // 0x10
>         .byte   33                              // 0x21
>         .byte   34                              // 0x22
>         .byte   35                              // 0x23
>         .byte   36                              // 0x24
>         .byte   37                              // 0x25
>         .byte   38                              // 0x26
>         .byte   39                              // 0x27
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .byte   255                             // 0xff
>         .text
>         .globl  foo
>         .p2align        2
>         .type   foo, at function
> 
> foo:                                    // @foo
>         .cfi_startproc
> // %bb.0:
>         adrp    x8, .LCPI0_0
>         add     x8, x8, :lo12:.LCPI0_0
>         ptrue   p0.b
>         ldr     q1, [x1]
>         ldr     q0, [x0]
>         ld1b    { z2.b }, p0/z, [x8]
>         tbl     z0.b, { z0.b, z1.b }, z2.b
>                                         // kill: def $q0 killed $q0 killed $z0
>         ret
> ```
I think this is correct, here is full DAG dump:
SelectionDAG has 16 nodes:
  t0: ch,glue = EntryToken
    t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t18: v16i8,ch = load<(load (s128) from %ir.ptr0, align 256)> t0, t2, undef:i64
    t4: i64,ch = CopyFromReg t0, Register:i64 %1
  t16: v16i8,ch = load<(load (s128) from %ir.ptr1, align 256)> t0, t4, undef:i64
    t11: v16i8 = vector_shuffle<0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23> t18, t16
  t13: ch,glue = CopyToReg t0, Register:v16i8 $q0, t11
  t22: nxv16i8 = insert_subvector undef:nxv16i8, t18, Constant:i64<0>
  t23: nxv16i8 = insert_subvector undef:nxv16i8, t16, Constant:i64<0>
  t14: ch = AArch64ISD::RET_GLUE t13, Register:v16i8 $q0, t13:1

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25869
+      (Subtarget->hasSVE2() && IndexLen < (MaxOffset / 2) &&
+       MinSVESize == MaxSVESize)) {
+
----------------
sdesmalen wrote:
> Please check this condition earlier up for an early bail out! (at same place where it checks MinSVESize)
No, Most of those checks related to support TBL2 both opperand used and at the place where MinSVESize check we don't know content of the mask, thus could not tell that both operand would be used or not.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152205/new/

https://reviews.llvm.org/D152205