[Mlir-commits] [mlir] [mlir][nvgpu] NVGPU Tutorials (PR #87065)

Wed Apr 10 00:59:25 PDT 2024

================
@@ -0,0 +1,320 @@
+# RUN: env SUPPORT_LIB=%mlir_cuda_runtime \
+# RUN:   %PYTHON %s | FileCheck %s
+
+# ===----------------------------------------------------------------------===//
+#  Chapter 5 : Warp Specialized GEMM with Tensor Core
+# ===----------------------------------------------------------------------===//
+#
+# This program exemplifies a GEMM operation for `f32+=f16*f16`, utilizing the
+# Warp Specialized method with a tile size of 128x128x64. The code completely
+# parallelizes the two outermost loops into thread blocks. It launches two Warp
+# Groups (256 threads in total): one for the producer and the other for the consumer.
+# Each group takes a different control-flow. The producer thread group is responsible
+# for loading data into shared memory, while the consumer group executes the Tensor
+# Core GEMM operation and epilogue.
+#
+#  for ti in range(M//128):  # -> blockIdx.x
+#   for tj in range(N//128): # -> blockIdx.y
+#    with wg_producer:
+#     for tk in range(K//64):
+#        TMA_128x64_64x128...
+#    with wg_consumer:
+#     for tk in range(K//64):
+#        MMA_128x128x64...
+#     Epilogue..
+#
+# This chapter demonstrates:
+#  2 WG (warpgroups)
----------------
jpienaar wrote:

I'd type this out a bit more.

https://github.com/llvm/llvm-project/pull/87065