[PATCH] D93594: [X86] Pass to transform amx intrinsics to scalar operation.

Wed Feb 24 05:03:57 PST 2021

pengfei added inline comments.

================
Comment at: llvm/include/llvm/CodeGen/Passes.h:496
+
+  /// The pass transform amx intrinsics to scalar operation if the function has
+  /// optnone attribute or it is O0.
----------------
transforms

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:1
+//===- llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp ----------------------===//
+//
----------------
We usually comment as //===--- filename - description ---===//
See `head -n1 llvm/lib/Target/X86/*.cpp`

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:51
+  BasicBlock *Header = BasicBlock::Create(
+      Preheader->getContext(), Name + ".header", Preheader->getParent(), Exit);
+  BasicBlock *Body = BasicBlock::Create(Header->getContext(), Name + ".body",
----------------
Ctx

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:88
+
+template <Intrinsic::ID IntrID,
+          typename = typename std::enable_if<
----------------
Can we just use `template <bool IsLoad>`? I think it also can reduce the branch.

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:99
+  Loop *RowLoop = LI.AllocateLoop();
+  Loop *ColLoop = LI.AllocateLoop();
+  RowLoop->addChildLoop(ColLoop);
----------------
Not sure how about the arithmetic intrinsics. But at least for load and store intrinsics we can use LLVM intrinsic `llvm.masked.load/store` to reduce the inner loop.

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:166
+    Value *Vec = nullptr;
+    if (auto *BitCast = dyn_cast<BitCastInst>(Tile))
+      Vec = BitCast->getOperand(0);
----------------
Maybe we can just use cast to help to raise the assertion.

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:223
+  Value *VecC, *VecA, *VecB;
+  if (auto *BitCast = dyn_cast<BitCastInst>(Acc))
+    VecC = BitCast->getOperand(0);
----------------
You can use cast to help to check the failure so that VecA/B/C won't be uninitialized.

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:229
+  // to vector. However with -O0, it doesn't happen.
+  if (auto *BitCast = dyn_cast<BitCastInst>(LHS))
+    VecA = BitCast->getOperand(0);
----------------
ditto

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:231
+    VecA = BitCast->getOperand(0);
+  assert(VecA->getType()->isVectorTy() && "bitcast from non-v256i32 to x86amx");
+  if (auto *BitCast = dyn_cast<BitCastInst>(RHS))
----------------
Should check it is V256I32?

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:232
+  assert(VecA->getType()->isVectorTy() && "bitcast from non-v256i32 to x86amx");
+  if (auto *BitCast = dyn_cast<BitCastInst>(RHS))
+    VecB = BitCast->getOperand(0);
----------------
ditto

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:288
+  // %acc = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> %131)
+  // %neweltc = add i32 %elt, %acc
+  // %NewVecC = insertelement <256 x i32> %vec.c.inner.phi, i32 %neweltc,
----------------
eltc?

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:311
+  Value *ResElt = B.CreateAdd(EltC, SubVecR);
+  Value *NewVecC = B.CreateInsertElement(VecCPhi, ResElt, IdxC);
+  Value *NewVecD = B.CreateInsertElement(VecDPhi, ResElt, IdxC);
----------------
Is it necessary to insert the ResElt to VecC?

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:340
+                IntrID == Intrinsic::x86_tilestored64_internal>::type>
+  bool lowerTileLoadStore(Instruction *TileLoad);
+  bool lowerTileLoad(Instruction *TileLoad);
----------------
TileLoadStore

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:341
+  bool lowerTileLoadStore(Instruction *TileLoad);
+  bool lowerTileLoad(Instruction *TileLoad);
+  bool lowerTileDPBSSD(Instruction *TileDPBSSD);
----------------
Forgot to remove?

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:343
+  bool lowerTileDPBSSD(Instruction *TileDPBSSD);
+  bool lowerTileStore(Instruction *TileStore);
+  bool lowerTileZero(Instruction *TileZero);
----------------
ditto

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:387
+
+template <Intrinsic::ID IntrID,
+          typename = typename std::enable_if<
----------------
ditto

================
Comment at: llvm/lib/Target/X86/X86LowerAMXIntrinsics.cpp:391
+              IntrID == Intrinsic::x86_tilestored64_internal>::type>
+bool X86LowerAMXIntrinsics::lowerTileLoadStore(Instruction *TileLoad) {
+  Value *M, *N, *Ptr, *Stride, *Tile;
----------------
ditto

================
Comment at: llvm/lib/Target/X86/X86LowerAMXType.cpp:333
+    TargetMachine *TM = &getAnalysis<TargetPassConfig>().getTM<TargetMachine>();
+    if (F.hasFnAttribute(Attribute::OptimizeNone) ||
+        TM->getOptLevel() == CodeGenOpt::None)
----------------
ditto

================
Comment at: llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll:1
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
+; RUN: opt -mtriple=x86_64 -lower-amx-intrinsics %s -S | FileCheck %s
----------------
Better name it amx-low-intrinsics-no-amx-bitcast.ll

================
Comment at: llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll:13
+; CHECK-NEXT:    br label [[TILELOAD_SCALARIZE_ROWS_BODY:%.*]]
+; CHECK:       tileload.scalarize.rows.body:
+; CHECK-NEXT:    br label [[TILELOAD_SCALARIZE_COLS_HEADER:%.*]]
----------------
It seems the body block is not necessary

================
Comment at: llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll:19
+; CHECK-NEXT:    br label [[TILELOAD_SCALARIZE_COLS_BODY:%.*]]
+; CHECK:       tileload.scalarize.cols.body:
+; CHECK-NEXT:    [[TMP1:%.*]] = zext i16 [[TILELOAD_SCALARIZE_ROWS_IV]] to i64
----------------
ditto. The lable `TILELOAD_SCALARIZE_COLS_BODY` even not been used.

================
Comment at: llvm/test/CodeGen/X86/AMX/amx-low-intrinsics-no-bitcast.ll:31
+; CHECK-NEXT:    br label [[TILELOAD_SCALARIZE_COLS_LATCH]]
+; CHECK:       tileload.scalarize.cols.latch:
+; CHECK-NEXT:    [[TILELOAD_SCALARIZE_COLS_STEP]] = add i16 [[TILELOAD_SCALARIZE_COLS_IV]], 1
----------------
I think cols.latch is not necessary either.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D93594/new/

https://reviews.llvm.org/D93594