[PATCH] D138203: [AArch64][InstCombine] Simplify repeated complex patterns in dupqlane

Tue Jan 10 07:31:19 PST 2023

sdesmalen added a comment.

I've left a bunch more comments, but overall this patch is moving in the right direction @MattDevereau!

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1483
+    auto Idx = cast<ConstantInt>(InsertElt->getOperand(2));
+    InsertEltVec[Idx->getValue().getZExtValue()] = InsertElt->getOperand(1);
+    CurrentInsertElt = InsertElt->getOperand(0);
----------------
MattDevereau wrote:
> MattDevereau wrote:
> > sdesmalen wrote:
> > > What happens if there is an insert into the same lane?
> > > 
> > > e.g. 
> > > 
> > >   %1 = insertelement <8 x half> %0, half %x, i64 1
> > >   %2 = insertelement <8 x half> %1, half %y, i64 1
> > The latest insert is used (%y in this case) and the transform works ok, I'll add a test for this.
> I've added the test `dupq_f16_abcd_pattern_double_insert` to `sve-intrinsic-dupqlane.ll` which asserts this behaviour.
I still think that the code is not correct yet, because for e.g.

  ...
  %7 = insertelement <8 x half> %6, half %c, i64 6
  %8 = insertelement <8 x half> %7, half %c, i64 7
  %9 = insertelement <8 x half> %8, half %d, i64 7
  %10 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %9, i64 0)

It starts at %9, so will first store %d to InsertEltVec[7].
Then it looks at %8, and it will store %c to InsertEltVec[7]

Which means the order is reversed.

The reason that your test still passes is that InstCombine has already removed the extra `insertelement` when it comes to this function.

Perhaps you can just return `false` if the element already exists in InsertEltVec?

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1443-1448
+  for (auto Lhs = Vec.begin(), Rhs = Vec.begin() + HalfVecSize;
+       Rhs != Vec.end(); Lhs++, Rhs++) {
+    if (*Lhs != nullptr && *Rhs != nullptr && *Lhs == *Rhs)
+      continue;
+    return false;
+  }
----------------
nit: please capitalise Lhs -> LHS and Rhs -> RHS (that's more common in LLVM for these variable names).

And feel free to ignore this suggestion, but I personally find this a little easier to read:

  ArrayRef<Value*> Ref(Vec);                                       
  auto LHS = Ref.take_front(HalfVecSize);
  auto RHS = Ref.drop_front(HalfVecSize);
  for (unsigned I=0; I<HalfVecSize; ++I) {
    if (LHS[I] != nullptr && RHS[I] != nullptr && LHS[I] == RHS[I])
      continue;

(i.e. using ArrayRef and indexing, rather than two changing pointers)

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1451-1452
+  Vec.resize(HalfVecSize);
+  if (Vec.size() == 1)
+    return true;
+  SimplifyValuePattern(Vec);
----------------
Can this condition be moved to the start of this function?

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1470
+  // Insert the scalars into a container ordered by InsertElement index
+  SmallVector<Value *> InsertEltVec(IIScalableTy->getMinNumElements(), nullptr);
+  while (InsertElementInst *InsertElt =
----------------
nit: is it worth renaming this to `Elts` ? That's a bit simpler.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1471
+  SmallVector<Value *> InsertEltVec(IIScalableTy->getMinNumElements(), nullptr);
+  while (InsertElementInst *InsertElt =
+             dyn_cast<InsertElementInst>(CurrentInsertElt)) {
----------------
nit: this can be `auto`, because the type is clear from the `dyn_cast<InsertElementInst>`.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1478
+
+  if (!isa<PoisonValue>(CurrentInsertElt) || !isa<PoisonValue>(Default) ||
+      !SimplifyValuePattern(InsertEltVec))
----------------
I don't think you'll need to check this, since you've already got the check for `*Lhs != nullptr && *Rhs != nullptr ` in SimplifyValuePattern, which should always return `false` if any of the lanes is not explicitly defined. And because the size of `InsertEltVec` is based on the minimum number of elements of the scalable vector, it will also return false for the following case:

  %0 = insertelement <2 x half> poison, half %a, i64 0
  %1 = insertelement <2 x half> %0, half %b, i64 0
  %2 = call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v2f16(<vscale x 8 x half> poison, <2 x half> %1, i64 0)

Because InsertEltVec will be defined as:

  [%a, %b, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr]

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1486
+  Constant *Zero = ConstantInt::get(Builder.getInt64Ty(), APInt(64, 0));
+  Value *InsertEltChain = Builder.CreateInsertElement(
+      PoisonValue::get(CurrentInsertElt->getType()), InsertEltVec[0], Zero);
----------------
If you define `InsertEltChain` as `PoisonValue::get(CurrentInsertElt->getType())`, then you can start the loop at I = 0.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1489
+  for (size_t I = 1; I < InsertEltVec.size(); I++) {
+    Constant *Idx = ConstantInt::get(Builder.getInt64Ty(), APInt(64, I));
+    InsertEltChain =
----------------
I think you can do just `Builder.getInt64(I)` (and then you don't need the extra variable for it, and can propagate it directly into the expression below, and also drop the curly braces :)

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1498
+  // be narrowed back to the original type.
+  int PatternWidth = IIScalableTy->getScalarSizeInBits() * InsertEltVec.size();
+  int PatternElementCount = IIScalableTy->getScalarSizeInBits() *
----------------
unsigned?

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1499
+  int PatternWidth = IIScalableTy->getScalarSizeInBits() * InsertEltVec.size();
+  int PatternElementCount = IIScalableTy->getScalarSizeInBits() *
+                            IIScalableTy->getMinNumElements() / PatternWidth;
----------------
unsigned?

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:1508
+  auto ZeroIdx = ConstantInt::get(Builder.getInt64Ty(), APInt(64, 0));
+  CallInst *InsertSubvector = Builder.CreateInsertVector(
+      II.getType(), PoisonValue::get(II.getType()), InsertEltChain, ZeroIdx);
----------------
can just as well use `auto` or `Value *` for all these variables here, since their contents don't really matter too much.

================
Comment at: llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll:30
+; CHECK-NEXT:    [[TMP3:%.*]] = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> [[TMP2]], i64 0)
+; CHECK-NEXT:    [[TMP4:%.*]] = tail call <vscale x 8 x half> @llvm.aarch64.sve.dupq.lane.nxv8f16(<vscale x 8 x half> [[TMP3]], i64 0)
+; CHECK-NEXT:    ret <vscale x 8 x half> [[TMP4]]
----------------
It's a pity this case isn't optimised. This is probably because instcombine has transformed it into a 'shufflevector' splat. You could do something like this:

  if (Value *V = getSplatValue(CurrentInsertElt)) {              
    InsertEltVec[0] = V;                                         
    InsertEltVec.resize(1);                                      
  } else {                                                       
    while (InsertElementInst *InsertElt =                        
               dyn_cast<InsertElementInst>(CurrentInsertElt)) {  
    ...
  }

to handle it.

================
Comment at: llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll:94
+
+define dso_local <vscale x 8 x half> @dupq_f16_abcnull_pattern(half %a, half %b, half %c, half %d) {
+; CHECK-LABEL: @dupq_f16_abcnull_pattern(
----------------
Is this test much different from `@dupq_f16_ab_pattern_no_end_indices`?

================
Comment at: llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll:136-137
+  %7 = insertelement <8 x half> %6, half %c, i64 6
+  %8 = insertelement <8 x half> %7, half %c, i64 7
+  %9 = insertelement <8 x half> %8, half %d, i64 7
+  %10 = tail call <vscale x 8 x half> @llvm.vector.insert.nxv8f16.v8f16(<vscale x 8 x half> poison, <8 x half> %9, i64 0)
----------------
If you swap `%c` and `%d` this at least becomes a negative test to ensure LLVM doesn't perform the wrong optimisation. That's better than it being a positive test where it just happens to do the right thing because another InstCombine rule has fired first, even though your code would have optimised this incorrectly.

================
Comment at: llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll:237
+
+define dso_local <vscale x 8 x half> @dupq_f16_ab_pattern_no_end_indices_not_poison(half %a, half %b, <vscale x 8 x half> %c) {
+; CHECK-LABEL: @dupq_f16_ab_pattern_no_end_indices_not_poison(
----------------
no end indices not poison? (perhaps there is a double negative here that confuses me?)

In any case, the end indices are poison, because the test is inserting into `poison`

================
Comment at: llvm/test/Transforms/InstCombine/AArch64/sve-intrinsic-dupqlane.ll:260
+
+define dso_local <vscale x 8 x half> @dupq_f16_ab_no_front_pattern(half %a, half %b) {
+; CHECK-LABEL: @dupq_f16_ab_no_front_pattern(
----------------
Given the algorithm just compares two halves of the vector, I'm not sure there's much value in having both negative tests for "no_front_pattern", "no_middle_pattern" and "no_end_pattern"

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138203/new/

https://reviews.llvm.org/D138203