[Mlir-commits] [mlir] [mlir][nvgpu] Mark TMA descriptor as MemWriteAt in `tma.async.store` (PR #79427)

Fri Jan 26 07:45:08 PST 2024

grypp wrote:

> I have some questions about the NVGPU_TensorMapDescriptor type in MLIR.
> Does it map to a CUtensorMap object? Or to a pointer to a CUtensorMap object?

This is a great question. I've done experiments to understand the fast way to pass `CUtensorMap` to the device. I found two approaches: 
1. Pass the `CUtensorMap` object as a kernel parameter (not as a pointer), but it requires mapping the kernel parameter as `__grid_constant__`, which currently lacks support in LLVM. 
2. Copying `CUtensorMap *` to the device and use the `prefetch.tensor` PTX instruction ensures the descriptor is in the cache.

> In Cuda, the CUtensorMap object is setup on the host, and then copied to the GPU memory. Inside the kernel the PTX instruction are using a CUtensorMap *.

This is the 2nd way, I believe I have implemented the functionality as described. Let me explain `CUtensorMap` generation and copy steps. Let's take IR below and run `convert-nvgpu-to-nvvm` pass:

```
!mbarDesc = !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 1>
!tmaDesc = !nvgpu.tensormap.descriptor<tensor = memref<64x64xf16, 3>, swizzle = swizzle_128b, l2promo = none, oob = zero, interleave = none>

func.func @main() {
  // ... (omitting some lines for brevity)

  // Call TMA descriptor and memcpy to the device 
  %2 = nvgpu.tma.create.descriptor %cast box[%c64, %c64] : memref<*xf16> -> !tmaDesc

  gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c1, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c64, %arg10 = %c1, %arg11 = %c1) dynamic_shared_memory_size %c8192 {
    // ... (omitting some lines for brevity)

    // Call to initialize mbarrier
    %5 = nvgpu.mbarrier.create -> !mbarDesc
    nvgpu.mbarrier.init %5[%c0], %c1 : !mbarDesc

    // Asynchronously load TMA descriptor to shared memory
    nvgpu.tma.async.load %2[%c0, %c0], %5[%c0] to %shmem, predicate = %tidx0 : !tmaDesc, !mbarDesc -> memref<64x64xf16, #gpu.address_space<workgroup>>

    gpu.terminator
  }
  return
}
```

The `%tensorDesc` is a device pointer. Because [mgpuTensorMapEncodeTiledMemref](https://github.com/llvm/llvm-project/blob/main/mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp#L498-L503) function creates a `CUtensorMap` object and memcpy it to the device:
```
%tensorDesc = llvm.call @mgpuTensorMapEncodeTiledMemref(%9, %10, %3, %1, %0, %1, %1, !llvm.ptr) : (i64, !llvm.ptr, i64, i64, i64, i64, i64, !llvm.ptr) -> !llvm.ptr
gpu.launch () ...
{
  // ... (omitting some lines for brevity)

  // Asynchronously perform a bulk tensor operation with shared memory
  nvvm.cp.async.bulk.tensor.shared.cluster.global %23, %tensorDesc, %21, box[%24, %24] predicate = %15 : <3>, !llvm.ptr 
}

```

https://github.com/llvm/llvm-project/pull/79427