[Mlir-commits] [mlir] [mlir][nvgpu] update commit group and wait async ops (PR #130482)

Sun Mar 30 04:26:08 PDT 2025

linuxlonelyeagle wrote:

> > As n gets larger, the parameters to be iterated in the loop increase linearly.But it's not really necessary to introduce the result of the commit_group in the loop.This made it more difficult for me to codegen gpu code.
> 
> So, you're saying that generating iter_args for the token is difficult. However, I don't find it challenging to generate iter_args. When it's lowered, there's no downside because it results in the same PTX, whether using the current approach or the proposed one.
> 
> I think it’s better to focus on whether the current approach actually blocks you.

This method wouldn't really stop me, but it certainly adds to the complexity of the problem. Realize that there are other parameters in the loop that need to be iterated over.
The two parameters that are required in the loop.`smem_read` Indicates where the current iteration prefetches data to share memory.`smem_write`,Indicates the location of the share memory to be read into the register.

Three possible parameters.
A, B, and C registers of the matrix.If the size of the k dimension of the warp tile to be computed is equal to the k size of the tensor core.Indicates that we do not need to prefetch registers.These three parameters will not exist.

Now we are discussing the parameters of wait group.The issue is now honestly complicated enough.
This algorithm is similar to https://github.com/NVIDIA/cutlass/blob/06e560d98a5fe8acb975db2c4c26817b6c90acb1/examples/cute/tutorial/sgemm_sm80.cu

**A very key point is that I'm not the only one who benefits from this PR.** Though I've been waiting for this PR to merge.You didn't answer my second question.It come from https://mlir.llvm.org/docs/Dialects/NVGPU/#nvgpudevice_async_create_group-nvgpudeviceasynccreategroupop ,It's wrong, isn't it?
```
// copy 1.
%cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
// copy 2.
%cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
%token1 = nvgpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
%token2 = nvgpu.device_async_create_group %cp3
// after the wait copy 1 and copy 2 are complete.
nvgpu.device_async_wait %token1
// after the wait copy 3 is complete.
nvgpu.device_async_wait %token2
```

https://github.com/llvm/llvm-project/pull/130482