[PATCH] D101167: [AArch64][SVE] Convert svdup(vec, SV_VL1, elm) to insertelement(vec, elm, 0)
Joe Ellis via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Apr 26 02:05:24 PDT 2021
joechrisellis added a comment.
Hello! This looks good to me modulo a few nits.
Not necessarily for this commit, but I ended up implementing a similar optimisation myself as part of some other work. Something else we could do is recognise when a series of DUP calls is the same as a series of insertelement calls. For example:
define <vscale x 16 x i8> @dup_insertelement_multi(<vscale x 16 x i8> %v, i8 %s) #0 {
%pg1 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 3)
%insert1 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %v, <vscale x 16 x i1> %pg1, i8 %s)
%pg2 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 2)
%insert2 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %insert1, <vscale x 16 x i1> %pg2, i8 %s)
%pg3 = tail call <vscale x 16 x i1> @llvm.aarch64.sve.ptrue.nxv16i1(i32 1)
%insert3 = tail call <vscale x 16 x i8> @llvm.aarch64.sve.dup.nxv16i8(<vscale x 16 x i8> %insert2, <vscale x 16 x i1> %pg3, i8 %s)
ret <vscale x 16 x i8> %insert1
}
is the same as:
define <vscale x 16 x i8> @dup_insertelement_multi(<vscale x 16 x i8> %v, i8 %s) #0 {
%1 = insertelement <vscale x 16 x i8> %v, i8 %s, i64 2
%2 = insertelement <vscale x 16 x i8> %1, i8 %s, i64 1
%3 = insertelement <vscale x 16 x i8> %2, i8 %s, i64 0
ret <vscale x 16 x i8> %1
}
Doing this might look like:
// NOTE: not tested very extensively at all
auto *Cursor = I;
unsigned ExpectedPTruePattern = AArch64SVEPredPattern::vl1;
while (Cursor && Cursor->getIntrinsicID() == Intrinsic::aarch64_sve_dup &&
ExpectedPTruePattern <= AArch64SVEPredPattern::vl8) {
Value *Dst = Cursor->getArgOperand(0);
Value *Pg = Cursor->getArgOperand(1);
Value *Splat = Cursor->getArgOperand(2);
auto *PTrue = dyn_cast<IntrinsicInst>(Pg);
if (!PTrue || PTrue->getIntrinsicID() != Intrinsic::aarch64_sve_ptrue)
break;
const auto PTruePattern =
cast<ConstantInt>(PTrue->getOperand(0))->getZExtValue();
if (PTruePattern != ExpectedPTruePattern)
break;
LLVMContext &Ctx = Cursor->getContext();
IRBuilder<> Builder(Ctx);
Builder.SetInsertPoint(Cursor);
auto *Insert = Builder.CreateInsertElement(Dst, Splat, ExpectedPTruePattern - 1);
Cursor->replaceAllUsesWith(Insert);
Cursor->eraseFromParent();
if (PTrue->use_empty())
PTrue->eraseFromParent();
Cursor = dyn_cast<IntrinsicInst>(Dst);
ExpectedPTruePattern++;
Changed = true;
}
(alternatively, we could keep the existing optimisation as-is, but add another function that looks for insertions into DUPs)
We see these chained-dups when we're moving data between NEON and SVE ACLE types[0]. I think doing this further optimisation might make sense in the context of this patch, but not a blocker from me.
[0]: https://developer.arm.com/documentation/ka004612/latest
================
Comment at: llvm/lib/Target/AArch64/SVEIntrinsicOpts.cpp:547
+
+ // The intrinsic is inserting into lane zero so use an extract instead.
+ Type *IdxTy = Type::getInt64Ty(I->getContext());
----------------
nit: I think this should say `insert` instead.
================
Comment at: llvm/test/CodeGen/AArch64/sve-intrinsic-opts-dup.ll:1
+; RUN: opt -S -aarch64-sve-intrinsic-opts -mtriple=aarch64-linux-gnu < %s | FileCheck %s
+
----------------
nit: can we declare the triple in the IR instead of in the run line?
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D101167/new/
https://reviews.llvm.org/D101167
More information about the llvm-commits
mailing list