[PATCH] D152205: [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types.

Mon Aug 21 08:27:19 PDT 2023

sdesmalen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25834

-  return SDValue();
+  // Avoid producing TBL and TBL2 instructions if we don't know
+  // SVE register size or minimal is unequal to maximum size.
----------------
sdesmalen wrote:
> It probably makes sense to move this functionality into its own function.
Comment seems unaddressed.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25857
+    Op1IsUndefOrZero = true;
+    std::swap(Op1, Op2);
+  }
----------------
dtemirbulatov wrote:
> sdesmalen wrote:
> > It's probably easier to update the ShuffleMask here. That simplifies the logic below as it avoids having to constantly distinguish between the `Op*IsUndefOrZero` cases.
> hmm, ShuffleMask is readonly variable.
You can create a new ShuffleMask variable, which is writable.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25846
+  // a scalable register.
+  if (NumElts > IndexLen || (MaskSize != NumElts && (2 * NumElts > IndexLen)) ||
+      isZerosVector(Op1.getNode()) || isZerosVector(Op2.getNode()))
----------------
I think you can make this into an assert that NumElts <= IndexLen, because at the point of lowering the operation, the types will have been legalised in such a way that the number of lanes of the mask >= number of lanes of the source operand.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25847
+  if (NumElts > IndexLen || (MaskSize != NumElts && (2 * NumElts > IndexLen)) ||
+      isZerosVector(Op1.getNode()) || isZerosVector(Op2.getNode()))
+    return SDValue();
----------------
There are two things not right about this:
1. The **contents** of `Op1` and `Op2` shouldn't matter for the lowering of the shuffle, so there is no reason to bail out here. (If you remove this, does any test fail?)
2. `isZerosVector` will never return true, because `Op1` and `Op2` are `INSERT_SUBVECTOR` nodes because they are assigned with `convertToScalableVector()`, and `INSERT_SUBVECTOR` is not considered in `isZerosVector`.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25859
+    Op1IsUnused = true;
+    std::swap(Op1, Op2);
+  }
----------------
What is the meaning of 'Op1IsUnused' after swapping Op1 and Op2? Should that be Op2IsUnused now?

To me it seems better to update the ShuffleMask directly and remove the need for `Op*IsUnused` altogether

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25864
+  // could be represented.
+  if (Op2IsUnused || Op1IsUnused ||
+      // For 8-bit elements and 1024-bit SVE registers and MaxOffset equals
----------------
It's normally better to bail out early, e.g.

  if (condition) {
    // lots of code
  }

  return SDValue();

->

  if (!condition)
   return SDValue
  // lots of code

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25868
+      // of the shufflevector, thus we are rejecting this transform.
+      (Subtarget->hasSVE2() && IndexLen < (MaxOffset / 2) &&
+       MinSVESize == MaxSVESize)) {
----------------
This condition and the comment above make little sense to me. The issue to solve is instead about the maximum *index* value in the shuffle mask. If the maximum value being indexed cannot be represented in an i8, then we cannot use TBL2 to lower this shuffle vector.

For example:

```   
define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) {
  %op0 = load <256 x i8>, ptr %ptr0
  %op1 = load <256 x i8>, ptr %ptr1
  ; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8!
  %res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0,i32 1,i32 2,i32 3,i32 4, i32 5, i32 6, i32 7, i32 256,i32 257, i32 258, i32 259, i32 260, i32 261, i32 262, i32 263>
  ret <16 x i8> %res
}
```

However, the code in the patch seems to accept this without an issue and generates wrong code with the wrong indices.
```
.LCPI0_0:
        .byte   0                               // 0x0
        .byte   1                               // 0x1
        .byte   2                               // 0x2
        .byte   3                               // 0x3
        .byte   4                               // 0x4
        .byte   5                               // 0x5
        .byte   6                               // 0x6
        .byte   7                               // 0x7
        .byte   16                              // 0x10
        .byte   33                              // 0x21
        .byte   34                              // 0x22
        .byte   35                              // 0x23
        .byte   36                              // 0x24
        .byte   37                              // 0x25
        .byte   38                              // 0x26
        .byte   39                              // 0x27
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .text
        .globl  foo
        .p2align        2
        .type   foo, at function

foo:                                    // @foo
        .cfi_startproc
// %bb.0:
        adrp    x8, .LCPI0_0
        add     x8, x8, :lo12:.LCPI0_0
        ptrue   p0.b
        ldr     q1, [x1]
        ldr     q0, [x0]
        ld1b    { z2.b }, p0/z, [x8]
        tbl     z0.b, { z0.b, z1.b }, z2.b
                                        // kill: def $q0 killed $q0 killed $z0
        ret
```

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25869
+      (Subtarget->hasSVE2() && IndexLen < (MaxOffset / 2) &&
+       MinSVESize == MaxSVESize)) {
+
----------------
Please check this condition earlier up for an early bail out! (at same place where it checks MinSVESize)

================
Comment at: llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll:605
+
+define <8 x i8> @shuffle_index_size_unacceptable_op_both(ptr %a, ptr %b) #3 {
+; CHECK-LABEL: shuffle_index_size_unacceptable_op_both:
----------------
Why doesn't this function use the TBL instruction? I don't see anything that stops it from being to use TBL?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152205/new/

https://reviews.llvm.org/D152205