[PATCH] D152205: [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types.

Mon Aug 14 07:02:18 PDT 2023

sdesmalen added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25834

-  return SDValue();
+  // Avoid producing TBL and TBL2 instructions if we don't know
+  // SVE register size or minimal is unequal to maximum size.
----------------
It probably makes sense to move this functionality into its own function.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25847
+
+  bool Op1IsUndefOrZero = false;
+  bool Op2IsUndefOrZero = false;
----------------
You seem to have removed checks to ask if Op1 is `poison`/`undef` or `zeroinitializer`.

The way you've written the code now, `Op1IsUndefOrZero` should be renamed instead to `Op1IsUnused`. (same for Op2IsUnused)

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25854
+  if (llvm::all_of(ShuffleMask, [&NumElts](unsigned Val) {
+        return (Val >= NumElts && (Val < (NumElts * 2)));
+      })) {
----------------
Is the condition `&& (Val < NumElts * 2)` needed? This seems already covered by the `if (NumElts > IndexLen)` check above on line 25844.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25857
+    Op1IsUndefOrZero = true;
+    std::swap(Op1, Op2);
+  }
----------------
It's probably easier to update the ShuffleMask here. That simplifies the logic below as it avoids having to constantly distinguish between the `Op*IsUndefOrZero` cases.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25861
+  // Currently maximum offset number that could be lowered is 255.
+  if ((!Op2IsUndefOrZero && !Op1IsUndefOrZero && IndexLen > 128) ||
+      IndexLen > 256)
----------------
Hardcoding a check for '128' doesn't seem right to me. It all depends on what can be represented in the element type of the instruction, i.e. for i8's the max value is 255. That means when targeting `vscale_range(16, 16)`, then 255 is a valid index into the first source vector.

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25862
+  if ((!Op2IsUndefOrZero && !Op1IsUndefOrZero && IndexLen > 128) ||
+      IndexLen > 256)
+    return SDValue();
----------------
It doesn't seem right to look at the //number of indices// in the mask. I think rather it should be about the maximum index //value// in the mask?

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25879
+    else if (Op2IsUndefOrZero && Offset >= NumElts)
+      Offset = 255;
+    else
----------------
While not wrong, I don't really like '255' because it is a hard-coded value. For i8 elements, it may well be a valid index. For i16 or larger elements, there's no need to hard-code 255. Maybe just use '(1 << elementSize) - 1'?

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25889
+  // size with out of index elements.
+  for (unsigned i = 0; i < FillElements; ++i)
+    TBLMask.push_back(DAG.getConstant(255, DL, MVT::i64));
----------------
For a shufflevector that selects `IndexLen` elements from Op1/Op2, this code seems to be filling the //first// `VL - IndexLen` elements with some initializer (255 in this case). It seems odd to me that this loop starts at 0 and I doubt this has the desired effect.    

I'm not sure to what extend it matters what value we use for 'undef' lanes.  Can we just initialize the TBLMask with all numeric_limits<type>::max index value?

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25904-25906
+  } else {
+    if (!Subtarget->hasSVE2())
+      return SDValue();
----------------
Please write this as:

  } else if (Subtarget->hasSVE2()) {
    ...
  } else
    return SDValue();

================
Comment at: llvm/lib/Target/AArch64/AArch64ISelLowering.cpp:25752-25753
+  }
+  // It is still better to fill TBL mask to the actual hardware supported
+  // size with out of index elements.
+  for (unsigned i = 0; i < FillElements; ++i)
----------------
dtemirbulatov wrote:
> sdesmalen wrote:
> > I'm curious, why is that better?
> > 
> > Also, as my comment above suggests, for `<256 x i8>` vectors, `255` is just the last element of the first data-vector. There is also some test-coverage missing here.
> I think that if we leave uninitialized remaining out of scope indexes then they will end up with 0 value and means extra work since 0 element almost always means 0 lane and with 255 value probably hardware could understand this is out of scope value, except <256 x i8> case which a it rarely than valid 0 index. The last available value for index here is 255 then it flips to 0.
While it may be rare, it must be correct for all cases. If the LLVM IR instruction requests an element from a `zeroinitializer` vector is selected, then it must make sure to do so. That means that 255 may not always be the correct value. Maybe it should just return `SDValue()` for that case, or explicitly materialize a `zeroinitializer` vector to select the value from.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D152205/new/

https://reviews.llvm.org/D152205