[PATCH] D147408: [AMDGPU] Iterative scan implementation for atomic optimizer.

Fri Apr 28 05:35:07 PDT 2023

foad added inline comments.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPU.h:89
+enum class ScanOptions : bool { DPP, Iterative };
+FunctionPass *createAMDGPUAtomicOptimizerPass(bool IsUseDpp);
 void initializeAMDGPUAtomicOptimizerPass(PassRegistry &);
----------------
Just "UseDpp"?

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPU.h:245
+
+public:
+  ScanOptions ScanImpl;
----------------
No reason for this to be public.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:501-502
+      B.CreateIntrinsic(Intrinsic::amdgcn_ballot, WaveTy, B.getTrue());
+  Type *const BallotTy = Ballot->getType();
+  const unsigned TyBitWidth = DL->getTypeSizeInBits(BallotTy);
+
----------------
Don't need these. They will always be the same as WaveTy and WaveFrontSize.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:515
+    OldValuePhi = B.CreatePHI(Ty, 2, "old_value_phi");
+    OldValuePhi->addIncoming(V, EntryBB);
+  }
----------------
I think you can use an undef value here instead of V.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:521-522
+  // Call llvm.cttz instrinsic to find the lowest remaining active lane.
+  Function *CttzDecl = Intrinsic::getDeclaration(M, Intrinsic::cttz, WaveTy);
+  Value *FF1 = B.CreateCall(CttzDecl, {ActiveBits, B.getTrue()});
+  Value *LaneIdxInt = B.CreateTrunc(FF1, Ty);
----------------
Use B.CreateIntrinsic, then you don't need M.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:529
+
+  // Perfrom writelane if intermidiate scan results are required later in the
+  // kernel computations
----------------
Typo "perform", "intermediate"

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:533-536
+    Function *WriteLaneDecl =
+        Intrinsic::getDeclaration(M, Intrinsic::amdgcn_writelane, {});
+    OldValue =
+        B.CreateCall(WriteLaneDecl, {Accumulator, LaneIdxInt, OldValuePhi});
----------------
Use B.CreateIntrinsic

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:553
+  Value *IsEnd = B.CreateICmpEQ(NewActiveBits, B.getIntN(TyBitWidth, 0));
+  B.CreateCondBr(IsEnd, ComputeEnd, Compute);
+
----------------
Maybe change "Compute" to "ComputeLoop" everywhere, since it is the body of the loop?

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:658

+  llvm::Function *F = I.getParent()->getParent();
+  LLVMContext &C = I.getParent()->getContext();
----------------
Remove `llvm::`, use `getFunction`

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:664-665
+  if (ValDivergent && ScanImpl == ScanOptions::Iterative) {
+    Compute = BasicBlock::Create(C, "Compute", F);
+    ComputeEnd = BasicBlock::Create(C, "ComputeEnd", F);
+  }
----------------
Sink this down to line 700, where you use them?

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:740-746
+  if (ValDivergent && ScanImpl == ScanOptions::Iterative) {
+    // Only the first active lane will enter the new control flow to update the
+    // value.
+    CallInst *const FirstActiveLane =
+        B.CreateIntrinsic(Intrinsic::amdgcn_readfirstlane, {}, Mbcnt);
+    Cond = B.CreateICmpEQ(Mbcnt, FirstActiveLane);
+  } else {
----------------
I don't think you need to change any of this. The original way of doing the icmp should work in all cases.

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:767
+    B.SetInsertPoint(ComputeEnd);
+    Instruction *NewTerminator = Terminator->clone();
+    Terminator->eraseFromParent();
----------------
Why do you need to clone? Could you just do Terminator->removeFromParent then B.Insert(Terminator)?

================
Comment at: llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp:272-275
+// Enable atomic optimization with DPP
+static cl::opt<bool> EnableAtomicOptimizationsUsingDPP(
+    "amdgpu-atomic-optimizer-use-dpp",
+    cl::desc("Enable atomic optimizations using DPP"), cl::init(true),
----------------
Description seems a bit misleading to me since this option doesn't enable the whole atomic optimizer pass. I would suggest changing it to "Enable use of DPP in the atomic optimizer" or just "Use DPP in the atomic optimizer"

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D147408/new/

https://reviews.llvm.org/D147408