[PATCH] D101167: [AArch64][SVE] Convert svdup(vec, SV_VL1, elm) to insertelement(vec, elm, 0)

Mon Apr 26 02:05:24 PDT 2021

joechrisellis added a comment.

Hello! This looks good to me modulo a few nits.

Not necessarily for this commit, but I ended up implementing a similar optimisation myself as part of some other work. Something else we could do is recognise when a series of DUP calls is the same as a series of insertelement calls. For example:

  define <vscale x 16 x i8> @dup_insertelement_multi(<vscale x 16 x i8> %v, i8 %s) #0 {
    %pg1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 3)
    %insert1 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg1, i8 %s)
    %pg2 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 2)
    %insert2 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %insert1, <vscale x 16 x i1> %pg2, i8 %s)
    %pg3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 1)
    %insert3 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %insert2, <vscale x 16 x i1> %pg3, i8 %s)
    ret <vscale x 16 x i8> %insert1
  }

is the same as:

  define <vscale x 16 x i8> @dup_insertelement_multi(<vscale x 16 x i8> %v, i8 %s) #0 {
    %1 = insertelement <vscale x 16 x i8> %v, i8 %s, i64 2
    %2 = insertelement <vscale x 16 x i8> %1, i8 %s, i64 1
    %3 = insertelement <vscale x 16 x i8> %2, i8 %s, i64 0
    ret <vscale x 16 x i8> %1
  }

Doing this might look like:

  // NOTE: not tested very extensively at all
  auto *Cursor = I;
  unsigned ExpectedPTruePattern = AArch64SVEPredPattern::vl1;
  while (Cursor && Cursor->getIntrinsicID() == Intrinsic::aarch64_sve_dup &&
         ExpectedPTruePattern <= AArch64SVEPredPattern::vl8) {
    Value *Dst = Cursor->getArgOperand(0);
    Value *Pg = Cursor->getArgOperand(1);
    Value *Splat = Cursor->getArgOperand(2);

    auto *PTrue = dyn_cast<IntrinsicInst>(Pg);
    if (!PTrue || PTrue->getIntrinsicID() != Intrinsic::aarch64_sve_ptrue)
      break;

    const auto PTruePattern =
        cast<ConstantInt>(PTrue->getOperand(0))->getZExtValue();

    if (PTruePattern != ExpectedPTruePattern)
      break;

    LLVMContext &Ctx = Cursor->getContext();
    IRBuilder<> Builder(Ctx);
    Builder.SetInsertPoint(Cursor);

    auto *Insert = Builder.CreateInsertElement(Dst, Splat, ExpectedPTruePattern - 1);
    Cursor->replaceAllUsesWith(Insert);
    Cursor->eraseFromParent();

    if (PTrue->use_empty())
      PTrue->eraseFromParent();

    Cursor = dyn_cast<IntrinsicInst>(Dst);
    ExpectedPTruePattern++;
    Changed = true;
  }

(alternatively, we could keep the existing optimisation as-is, but add another function that looks for insertions into DUPs)

We see these chained-dups when we're moving data between NEON and SVE ACLE types[0]. I think doing this further optimisation might make sense in the context of this patch, but not a blocker from me.

[0]: https://developer.arm.com/documentation/ka004612/latest

================
Comment at: llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp:547
+
+  // The intrinsic is inserting into lane zero so use an extract instead.
+  Type *IdxTy = Type::getInt64Ty(I->getContext());
----------------
nit: I think this should say `insert` instead.

================
Comment at: llvm/test/CodeGen/AArch64/sve-intrinsic-opts-dup.ll:1
+; RUN: opt -S -aarch64-sve-intrinsic-opts -mtriple=aarch64-linux-gnu < %s | FileCheck %s
+
----------------
nit: can we declare the triple in the IR instead of in the run line?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D101167/new/

https://reviews.llvm.org/D101167