[Mlir-commits] [mlir] [MLIR-NVVMDialect]: Add cp.async.mbarrier.arrive Op (PR #74241)

Mon Dec 4 06:00:15 PST 2023

================
@@ -592,6 +592,42 @@ def NVVM_CpAsyncWaitGroupOp : NVVM_Op<"cp.async.wait.group">,
   let assemblyFormat = "$n attr-dict";
 }
 
+def NVVM_CpAsyncMBarrierArriveOp : NVVM_Op<"cp.async.mbarrier.arrive"> {
+  let summary = "NVVM Dialect Op for cp.async.mbarrier.arrive";
+  let description = [{
+    The `cp.async.mbarrier.arrive` Op makes the mbarrier object track
+    all prior cp.async operations initiated by the executing thread.
+    The `addr` operand specifies the address of the mbarrier object
+    in generic or shared memory space. The `noinc` attr impacts how
+    how the mbarrier's state is updated.
+    [For more information, refer PTX ISA]
+    (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-cp-async-mbarrier-arrive)
+  }];
+  let assemblyFormat = "$addr attr-dict `:` type(operands)";
+
+  let arguments = (ins
+    AnyTypeOf<[LLVM_AnyPointer, LLVM_PointerShared]>:$addr,
+    DefaultValuedAttr<I1Attr, "0">:$noinc);
+
+  string llvmBuilder = [{
+    unsigned addressSpace =
+      llvm::cast<LLVM::LLVMPointerType>(op.getAddr().getType()).getAddressSpace();
+    bool isShared = addressSpace == NVVM::kSharedMemorySpace;
+
+    llvm::Intrinsic::ID intId;
+    if ($noinc) {
+      intId = isShared ?
+        llvm::Intrinsic::nvvm_cp_async_mbarrier_arrive_noinc_shared :
+        llvm::Intrinsic::nvvm_cp_async_mbarrier_arrive_noinc;
+    } else {
+      intId = isShared ?
+        llvm::Intrinsic::nvvm_cp_async_mbarrier_arrive_shared :
+        llvm::Intrinsic::nvvm_cp_async_mbarrier_arrive;
+    }
+    createIntrinsicCall(builder, intId, {$addr});
+  }];
+}
+
----------------
durga4github wrote:

This was exactly the item I wanted to get our preference on!
I did notice that the MBarrier Ops in NVVM has 2 implementations depending on shared/generic.

However, I admit I did not look into the NVGPU side of things. Looking at the lowering code you pointed to, it makes sense to abstract (or split) in only one place/Dialect. I am also divided on where we _should_ do it and why one would be better than the other.

In any case, I don't want to create a diverged implementation within the same (MBarrier) family of Ops. So, I will refresh the patch with a split implementation.

(Another thought: At the intrinsics level, there are separate implementations for generic and shared. This way everything below NVGPU has separate Ops, with NVGPU being the layer of abstraction).

https://github.com/llvm/llvm-project/pull/74241