[Mlir-commits] [mlir] [mlir][nvgpu] Mark TMA descriptor as MemWriteAt in `tma.async.store` (PR #79427)
Guray Ozen
llvmlistbot at llvm.org
Fri Jan 26 07:45:08 PST 2024
grypp wrote:
> I have some questions about the NVGPU_TensorMapDescriptor type in MLIR.
> Does it map to a CUtensorMap object? Or to a pointer to a CUtensorMap object?
This is a great question. I've done experiments to understand the fast way to pass `CUtensorMap` to the device. I found two approaches:
1. Pass the `CUtensorMap` object as a kernel parameter (not as a pointer), but it requires mapping the kernel parameter as `__grid_constant__`, which currently lacks support in LLVM.
2. Copying `CUtensorMap *` to the device and use the `prefetch.tensor` PTX instruction ensures the descriptor is in the cache.
> In Cuda, the CUtensorMap object is setup on the host, and then copied to the GPU memory. Inside the kernel the PTX instruction are using a CUtensorMap *.
This is the 2nd way, I believe I have implemented the functionality as described. Let me explain `CUtensorMap` generation and copy steps. Let's take IR below and run `convert-nvgpu-to-nvvm` pass:
```
!mbarDesc = !nvgpu.mbarrier.group<memorySpace = #gpu.address_space<workgroup>, num_barriers = 1>
!tmaDesc = !nvgpu.tensormap.descriptor<tensor = memref<64x64xf16, 3>, swizzle = swizzle_128b, l2promo = none, oob = zero, interleave = none>
func.func @main() {
// ... (omitting some lines for brevity)
// Call TMA descriptor and memcpy to the device
%2 = nvgpu.tma.create.descriptor %cast box[%c64, %c64] : memref<*xf16> -> !tmaDesc
gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c1, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c64, %arg10 = %c1, %arg11 = %c1) dynamic_shared_memory_size %c8192 {
// ... (omitting some lines for brevity)
// Call to initialize mbarrier
%5 = nvgpu.mbarrier.create -> !mbarDesc
nvgpu.mbarrier.init %5[%c0], %c1 : !mbarDesc
// Asynchronously load TMA descriptor to shared memory
nvgpu.tma.async.load %2[%c0, %c0], %5[%c0] to %shmem, predicate = %tidx0 : !tmaDesc, !mbarDesc -> memref<64x64xf16, #gpu.address_space<workgroup>>
gpu.terminator
}
return
}
```
The `%tensorDesc` is a device pointer. Because [mgpuTensorMapEncodeTiledMemref](https://github.com/llvm/llvm-project/blob/main/mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp#L498-L503) function creates a `CUtensorMap` object and memcpy it to the device:
```
%tensorDesc = llvm.call @mgpuTensorMapEncodeTiledMemref(%9, %10, %3, %1, %0, %1, %1, !llvm.ptr) : (i64, !llvm.ptr, i64, i64, i64, i64, i64, !llvm.ptr) -> !llvm.ptr
gpu.launch () ...
{
// ... (omitting some lines for brevity)
// Asynchronously perform a bulk tensor operation with shared memory
nvvm.cp.async.bulk.tensor.shared.cluster.global %23, %tensorDesc, %21, box[%24, %24] predicate = %15 : <3>, !llvm.ptr
}
```
https://github.com/llvm/llvm-project/pull/79427
More information about the Mlir-commits
mailing list