[Mlir-commits] [mlir] 13d6233 - [MLIR][NVGPU] Fix nvgpu_arrive syntax in matmulBuilder.py (#113713)

Fri Oct 25 22:45:54 PDT 2024

Author: Durgadoss R
Date: 2024-10-26T11:15:50+05:30
New Revision: 13d6233e77982f2a596922a79365373e1466a968

URL: https://github.com/llvm/llvm-project/commit/13d6233e77982f2a596922a79365373e1466a968
DIFF: https://github.com/llvm/llvm-project/commit/13d6233e77982f2a596922a79365373e1466a968.diff

LOG: [MLIR][NVGPU] Fix nvgpu_arrive syntax in matmulBuilder.py (#113713)

This patch updates the syntax for nvgpu_arrive Op
in matmulBuilder.py. This fixes the compilation
error for this test.

For the warp-specialized matmul_kernel implementation,
removing the WaitGroupSyncOp (after the mma-main-loop)
fixes the hang observed.

With these two fixes, the test compiles and
executes successfully on an sm90a machine.

Signed-off-by: Durgadoss R <durgadossr at nvidia.com>

Added: 
    

Modified: 
    mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py

Removed: 
    


################################################################################
diff  --git a/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py b/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
index 75f0dc947e0681..5394d4a3272555 100644

--- a/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
+++ b/mlir/test/Integration/GPU/CUDA/sm90/python/tools/matmulBuilder.py
@@ -568,9 +568,7 @@ def generate_matmul_ws(
                                 barId,
                                 predicate=consumerPrimaryThread,
                             )
-                            nvgpu.mbarrier_arrive(
-                                ir.Type.parse("!nvgpu.mbarrier.token"), mbarDONE, barId
-                            )
+                            nvgpu.mbarrier_arrive(mbarDONE, barId)
                             debug_print(
                                 "[cons] iv={}  | mbarDONE[{}] arrive [done]",
                                 iv,
@@ -589,14 +587,9 @@ def generate_matmul_ws(
                         # Step 6.3.5. Yield
                         scf.yield_([new_acc, phaseParity])
 
-                    # Step 6.3. Wait All WGMMA
-                    nvvm.WgmmaWaitGroupSyncOp(0)
-
                     with ir.InsertionPoint(scf.IfOp(consumerPrimaryThread).then_block):
                         barId = c((K // BLOCK_K) % num_stages)
-                        nvgpu.mbarrier_arrive(
-                            ir.Type.parse("!nvgpu.mbarrier.token"), mbarDONE, barId
-                        )
+                        nvgpu.mbarrier_arrive(mbarDONE, barId)
                         scf.yield_([])
 
                     # Step 6.4. Epilogue (registers --> shared memory)