[Mlir-commits] [mlir] [MLIR-NVVMDialect]: Add cp.async.mbarrier.arrive Op (PR #74241)

Mon Dec 4 05:11:27 PST 2023

================
@@ -592,6 +592,42 @@ def NVVM_CpAsyncWaitGroupOp : NVVM_Op<"cp.async.wait.group">,
   let assemblyFormat = "$n attr-dict";
 }
 
+def NVVM_CpAsyncMBarrierArriveOp : NVVM_Op<"cp.async.mbarrier.arrive"> {
+  let summary = "NVVM Dialect Op for cp.async.mbarrier.arrive";
+  let description = [{
+    The `cp.async.mbarrier.arrive` Op makes the mbarrier object track
+    all prior cp.async operations initiated by the executing thread.
+    The `addr` operand specifies the address of the mbarrier object
+    in generic or shared memory space. The `noinc` attr impacts how
+    how the mbarrier's state is updated.
+    [For more information, refer PTX ISA]
+    (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-cp-async-mbarrier-arrive)
+  }];
+  let assemblyFormat = "$addr attr-dict `:` type(operands)";
+
+  let arguments = (ins
+    AnyTypeOf<[LLVM_AnyPointer, LLVM_PointerShared]>:$addr,
+    DefaultValuedAttr<I1Attr, "0">:$noinc);
+
+  string llvmBuilder = [{
+    unsigned addressSpace =
+      llvm::cast<LLVM::LLVMPointerType>(op.getAddr().getType()).getAddressSpace();
+    bool isShared = addressSpace == NVVM::kSharedMemorySpace;
+
+    llvm::Intrinsic::ID intId;
+    if ($noinc) {
+      intId = isShared ?
+        llvm::Intrinsic::nvvm_cp_async_mbarrier_arrive_noinc_shared :
+        llvm::Intrinsic::nvvm_cp_async_mbarrier_arrive_noinc;
+    } else {
+      intId = isShared ?
+        llvm::Intrinsic::nvvm_cp_async_mbarrier_arrive_shared :
+        llvm::Intrinsic::nvvm_cp_async_mbarrier_arrive;
+    }
+    createIntrinsicCall(builder, intId, {$addr});
+  }];
+}
+
----------------
grypp wrote:

This approach in the NVVM dialect differs slightly from other mbarrier OPs. This OP has single version for `shared` and `generic` addrspace. I've created two OPs in NVVM (shared and generic) and one OP in NVGPU dialect to select the appropriate NVVM OPs and perform other things for NVVM's expectations.

For instance, take `mbarrier.arrive.expect_tx`:

NVGPU has a single OP used with the `nvgpu.mbarrier.group` type, allowing access to multiple mbarriers using SSA value index (`%mbarrier[%c0]`).

```
nvgpu.mbarrier.arrive.expect_tx %mbarrier[%c0], %txcount 
: !nvgpu.mbarrier.group<
	memorySpace = #gpu.address_space<workgroup>, 
	num_barriers = 3>
```

[The nvgpu-nvvm lowering code](https://github.com/llvm/llvm-project/blob/main/mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp#L907C8-L930) selects the appropriate NVVM Op, performs pointer arithmetic for the mbarrier, and truncates the txcount.

In NVVM dialect, there are two OPs:

```
nvvm.mbarrier.arrive.expect_tx %0, %1 : !llvm.ptr, i32
nvvm.mbarrier.arrive.expect_tx.shared %0, %1 : !llvm.ptr<3>, i32
```

Let's consider where we should choose between generic and shared address spaces. We could follow the approach I used in NVGPU, or we can shift this choice to NVVM and make the OPs consistent. Either way works, but let's aim for unification.



https://github.com/llvm/llvm-project/pull/74241