[PATCH] D51969: [AMDGPU] Add an AMDGPU specific atomic optimizer.

Thu Sep 13 05:10:48 PDT 2018

arsenm added a comment.

Should include tests running only the IR pass

================
Comment at: lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:57
+
+  StringRef getPassName() const override { return "AMDGPU Atomic Optimizer"; }
+
----------------
This can be dropped

================
Comment at: lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:60-61
+  void getAnalysisUsage(AnalysisUsage &AU) const override {
+    AU.addRequired<LegacyDivergenceAnalysis>();
+    AU.addRequired<TargetPassConfig>();
+  }
----------------
This needs work, but should preserve the dominator tree

================
Comment at: lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:176
+  if (ValDivergent &&
+      (!HasDPP || (I.getType()->getPrimitiveSizeInBits() != 32))) {
+    return;
----------------
No getPrimitiveSizeInBits

================
Comment at: lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:207
+  //       \------------------> exit
+  BasicBlock *const EntryBB = I.getParent();
+  BasicBlock *const ExitBB = EntryBB->splitBasicBlock(&I);
----------------
All the consts are in the wrong place

================
Comment at: lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:212-213
+
+  // Delete the terminator of EntryBB as we are going to replace it.
+  EntryBB->getTerminator()->eraseFromParent();
+
----------------
I'm pretty sure there's a utility function that deals with what you're trying to do, that will also handle the DominatorTree updates for this

================
Comment at: lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:219
+  Type *const Ty = I.getType();
+  const unsigned TyBitWidth = Ty->getPrimitiveSizeInBits();
+  Type *const VecTy = VectorType::get(B.getInt32Ty(), 2);
----------------
You pretty much should never use getPrimitiveSizeInBits directly. Use the DataLayout

================
Comment at: lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:229-234
+  Function *const ReadRegisterIntrinsic = Intrinsic::getDeclaration(
+      I.getModule(), Intrinsic::read_register, B.getInt64Ty());
+  MDNode *const RegName =
+      llvm::MDNode::get(Context, llvm::MDString::get(Context, "exec"));
+  Value *const Metadata = llvm::MetadataAsValue::get(Context, RegName);
+  Value *const Exec = B.CreateCall(ReadRegisterIntrinsic, Metadata);
----------------
You need to mark this call site as convergent

================
Comment at: lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:243-244
+  Value *const ExtractHi = B.CreateExtractElement(BitCast, B.getInt32(1));
+  Function *const MbcntLoIntrinsic =
+      Intrinsic::getDeclaration(I.getModule(), Intrinsic::amdgcn_mbcnt_lo);
+  Function *const MbcntHiIntrinsic =
----------------
CreateIntrinsic

================
Comment at: lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:294-295
+    LaneOffset =
+        B.CreateCall(MovDPPIntrinsic, {NewV, B.getInt32(0x138), B.getInt32(0xf),
+                                       B.getInt32(0xf), B.getTrue()});
+    LaneOffset = B.CreateCall(WWMIntrinsic, LaneOffset);
----------------
Can you use enums to derive these values

================
Comment at: lib/Target/AMDGPU/AMDGPUAtomicOptimizer.cpp:367-378
+    Value *const BitCast = B.CreateBitCast(PHI, VecTy);
+    Value *const ExtractLo = B.CreateExtractElement(BitCast, B.getInt32(0));
+    Value *const ExtractHi = B.CreateExtractElement(BitCast, B.getInt32(1));
+    Value *const ReadFirstLaneLo =
+        B.CreateCall(ReadFirstLaneIntrinsic, ExtractLo);
+    Value *const ReadFirstLaneHi =
+        B.CreateCall(ReadFirstLaneIntrinsic, ExtractHi);
----------------
The canonical way to do this in IR is lshr x, 32 and truncate

https://reviews.llvm.org/D51969