[Mlir-commits] [mlir] [mlir][nvgpu] NVGPU Tutorials (PR #87065)

Sat Mar 30 09:48:12 PDT 2024

================
@@ -0,0 +1,92 @@
+# RUN: env SUPPORT_LIB=%mlir_cuda_runtime \
+# RUN:   %PYTHON %s | FileCheck %s
+
+# ===----------------------------------------------------------------------===//
+#  Chapter 3 : GEMM 64x64x64 with Tensor Core
+# ===----------------------------------------------------------------------===//
+#
+# This program demonstrates a GEMM operation with 64x64x64 matrix multiplication
+#
+# This chapter introduces demonstrates:
+# 1. Execute TMA Load for two input matrices
+# 2. Performs Tensor Core GEMM 64x64x64 by warpgroup
+# 3. Stores fragmented registers to global memory by warpgroup
+#
+# ===----------------------------------------------------------------------===//
+
+
+from mlir import ir
+from mlir.dialects import nvgpu, scf, arith, memref, vector, gpu
+from tools.nvdsl import *
+from mlir.extras import types as T
+import numpy as np
+
+
+ at NVDSL.mlir_func
+def gemm_64_64_64(x, y, z):
+    token_ty = ir.Type.parse("!gpu.async.token")
+    t1 = gpu.wait(token_ty, [])
+    x_dev, t2 = gpu.alloc(x.type, token_ty, [t1], [], [])
+    y_dev, t3 = gpu.alloc(y.type, token_ty, [t2], [], [])
+    z_dev, t4 = gpu.alloc(z.type, token_ty, [t3], [], [])
+    t5 = gpu.memcpy(token_ty, [t4], x_dev, x)
+    t6 = gpu.memcpy(token_ty, [t5], y_dev, y)
+    t7 = gpu.wait(token_ty, [t6])
+
+    sw = nvgpu.TensorMapSwizzleKind.SWIZZLE_128B
+    x_tma = TMA([N, N], x.type, swizzle=sw)
+    y_tma = TMA([N, N], y.type, swizzle=sw)
+    x_tma.create_descriptor(x_dev)
+    y_tma.create_descriptor(y_dev)
+
+    @NVDSL.mlir_gpu_launch(grid=(1, 1, 1), block=(128, 1, 1), smem=16384)
----------------
manishucsd wrote:

```
smem_size_in_bytes=N*N*get_type_size(x.type) + N*N*get_type_size(y.type)
@NVDSL.mlir_gpu_launch(grid=(1, 1, 1), block=(128, 1, 1), smem=smem_size_in_bytes)
```

Let us try and have as few magic numbers as possible in the tutorial. Also, highlighting the use of APIs that are already present in the tutorial. 

https://github.com/llvm/llvm-project/pull/87065