[Mlir-commits] [mlir] [mlir][nvgpu] update commit group and wait async ops (PR #130482)

Tue Mar 11 19:03:20 PDT 2025

linuxlonelyeagle wrote:

> I think this PR is breaking SSA semantic, this can cause problems right? Currently we have this ops:
> 
> ```
> %1 = nvgpu.device_async_create_group %0
> nvgpu.device_async_wait %1 { numGroups = 1 : i32 }
> ```
> 
> PR makes the ops like the following. There is no SSA anymore.
> 
> ```
> nvgpu.device_async_create_group %0
> nvgpu.device_async_wait { numGroups = 1 : i32 }
> ```

Yes.But I don't think it will cause any problem.You just need to think of it as writing PTX inline assembly.
Here are my reasons：
1.https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-async-wait-group
You see here, it actually has no operands.To be honest with you, I think there seems to have misunderstood `cp.async.wait_group`.
The following code comes from https://mlir.llvm.org/docs/Dialects/NVGPU/#nvgpudevice_async_copy-nvgpudeviceasynccopyop
```
// copy 1.
%cp1 = nvgpu.device_async_copy %A[%c0], %B[%c0], 4 :memref<16xf32> to memref<16xf32, 3>
// copy 2.
%cp2 = nvgpu.device_async_copy %C[%c0], %D[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 1 contains copy 1 and copy 2.
%token1 = nvgpu.device_async_create_group %cp1, %cp2
// copy 3.
%cp3 = nvgpu.device_async_copy %E[%c0], %F[%c0], 4 : memref<16xf32> to memref<16xf32, 3>
// group 2 contains copy 3.
%token2 = nvgpu.device_async_create_group %cp3
// after the wait copy 1 and copy 2 are complete.
nvgpu.device_async_wait %token1
// after the wait copy 3 is complete.
nvgpu.device_async_wait %token2
```
I think the use of the two wait is wrong.The first wait should wait for all async ops.The token here has no effect.
Here two groups are created. You need to wait for cp1 and cp2 to complete. That is, you need to wait for group1 to complete and ignore group2. You should say:
```
// num_group indicates the maximum number of unfinished groups.
nvgpu.device_async_wait {num_groups = 1 } 
```
One thing that needs to be made clear is that wait is waiting for groups, not the cp Ops in a certain group.
2. Consider the following structure.Based on the above concept.Is this structure a bit strange?
The fundamental problem is that the current implementation assumes that the wait is for the cp in the group, but it is not.

```
 // some async copy op...
%group = nvgpu.create_group ....
 nvgpu.async_wait_group %group { num_groups = nstage - 2 }
for xxx {
  for xxx {
    // some async copy op...
    %group = nvgpu.create_group ....
    nvgpu.async_wait_group %group { num_groups = nstage - 2 }
  } 
}
```
If I want to describe the semantics I mentioned above.
```
 // some async copy op...
%group = nvgpu.create_group ....
 nvgpu.async_wait_group %group { num_groups = nstage - 2 }
for xxx {
  for xxx {
    // some async copy op...
    %group1 = nvgpu.create_group ....
     %group2 = nvgpu.push_group %group, %group1
    nvgpu.async_wait_group %group2 { num_groups = nstage - 2 }
  } 
}
```
Doing this implies that we wait on the entire group.So the simplest approach is to remove the operand of the wait op, which actually only needs the attribute.

https://github.com/llvm/llvm-project/pull/130482