[PATCH] D93594: [X86] Pass to transform amx intrinsics to scalar operation.
Pengfei Wang via Phabricator via cfe-commits
cfe-commits at lists.llvm.org
Wed Feb 24 05:03:57 PST 2021
pengfei added inline comments.
================
Comment at: llvm/include/llvm/CodeGen/Passes.h:496
+
+ /// The pass transform amx intrinsics to scalar operation if the function has
+ /// optnone attribute or it is O0.
----------------
transforms
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:1
+//===- llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp ----------------------===//
+//
----------------
We usually comment as //===--- filename - description ---===//
See `head -n1 llvm/lib/Target/X86/*.cpp`
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:51
+ BasicBlock *Header = BasicBlock::Create(
+ Preheader->getContext(), Name + ".header", Preheader->getParent(), Exit);
+ BasicBlock *Body = BasicBlock::Create(Header->getContext(), Name + ".body",
----------------
Ctx
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:88
+
+template <Intrinsic::ID IntrID,
+ typename = typename std::enable_if<
----------------
Can we just use `template <bool IsLoad>`? I think it also can reduce the branch.
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:99
+ Loop *RowLoop = LI.AllocateLoop();
+ Loop *ColLoop = LI.AllocateLoop();
+ RowLoop->addChildLoop(ColLoop);
----------------
Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can use LLVM intrinsic `llvm.masked.load/store` to reduce the inner loop.
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:166
+ Value *Vec = nullptr;
+ if (auto *BitCast = dyn_cast<BitCastInst>(Tile))
+ Vec = BitCast->getOperand(0);
----------------
Maybe we can just use cast to help to raise the assertion.
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:223
+ Value *VecC, *VecA, *VecB;
+ if (auto *BitCast = dyn_cast<BitCastInst>(Acc))
+ VecC = BitCast->getOperand(0);
----------------
You can use cast to help to check the failure so that VecA/B/C won't be uninitialized.
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:229
+ // to vector. However with -O0, it doesn't happen.
+ if (auto *BitCast = dyn_cast<BitCastInst>(LHS))
+ VecA = BitCast->getOperand(0);
----------------
ditto
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:231
+ VecA = BitCast->getOperand(0);
+ assert(VecA->getType()->isVectorTy() && "bitcast from non-v256i32 to x86amx");
+ if (auto *BitCast = dyn_cast<BitCastInst>(RHS))
----------------
Should check it is V256I32?
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:232
+ assert(VecA->getType()->isVectorTy() && "bitcast from non-v256i32 to x86amx");
+ if (auto *BitCast = dyn_cast<BitCastInst>(RHS))
+ VecB = BitCast->getOperand(0);
----------------
ditto
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:288
+ // %acc = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %131)
+ // %neweltc = add i32 %elt, %acc
+ // %NewVecC = insertelement <256 x i32> %vec.c.inner.phi, i32 %neweltc,
----------------
eltc?
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:311
+ Value *ResElt = B.CreateAdd(EltC, SubVecR);
+ Value *NewVecC = B.CreateInsertElement(VecCPhi, ResElt, IdxC);
+ Value *NewVecD = B.CreateInsertElement(VecDPhi, ResElt, IdxC);
----------------
Is it necessary to insert the ResElt to VecC?
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:340
+ IntrID == Intrinsic::x86_tilestored64_internal>::type>
+ bool lowerTileLoadStore(Instruction *TileLoad);
+ bool lowerTileLoad(Instruction *TileLoad);
----------------
TileLoadStore
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:341
+ bool lowerTileLoadStore(Instruction *TileLoad);
+ bool lowerTileLoad(Instruction *TileLoad);
+ bool lowerTileDPBSSD(Instruction *TileDPBSSD);
----------------
Forgot to remove?
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:343
+ bool lowerTileDPBSSD(Instruction *TileDPBSSD);
+ bool lowerTileStore(Instruction *TileStore);
+ bool lowerTileZero(Instruction *TileZero);
----------------
ditto
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:387
+
+template <Intrinsic::ID IntrID,
+ typename = typename std::enable_if<
----------------
ditto
================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:391
+ IntrID == Intrinsic::x86_tilestored64_internal>::type>
+bool X86LowerAMXIntrinsics::lowerTileLoadStore(Instruction *TileLoad) {
+ Value *M, *N, *Ptr, *Stride, *Tile;
----------------
ditto
================
Comment at: llvm/lib/Target/X86/X86LowerAMXType.cpp:333
+ TargetMachine *TM = &getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
+ if (F.hasFnAttribute(Attribute::OptimizeNone) ||
+ TM->getOptLevel() == CodeGenOpt::None)
----------------
ditto
================
Comment at: llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll:1
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -mtriple=x86_64 -lower-amx-intrinsics %s -S | FileCheck %s
----------------
Better name it amx-low-intrinsics-no-amx-bitcast.ll
================
Comment at: llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll:13
+; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_ROWS_BODY:%.*]]
+; CHECK: tileload.scalarize.rows.body:
+; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_HEADER:%.*]]
----------------
It seems the body block is not necessary
================
Comment at: llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll:19
+; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_BODY:%.*]]
+; CHECK: tileload.scalarize.cols.body:
+; CHECK-NEXT: [[TMP1:%.*]] = zext i16 [[TILELOAD_SCALARIZE_ROWS_IV]] to i64
----------------
ditto. The lable `TILELOAD_SCALARIZE_COLS_BODY` even not been used.
================
Comment at: llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll:31
+; CHECK-NEXT: br label [[TILELOAD_SCALARIZE_COLS_LATCH]]
+; CHECK: tileload.scalarize.cols.latch:
+; CHECK-NEXT: [[TILELOAD_SCALARIZE_COLS_STEP]] = add i16 [[TILELOAD_SCALARIZE_COLS_IV]], 1
----------------
I think cols.latch is not necessary either.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D93594/new/
https://reviews.llvm.org/D93594
More information about the cfe-commits
mailing list