[llvm] [Doc][AMDGPU] Document the waitcnts required before SCOPE_SYS stores on GFX12 (PR #156424)

Wed Sep 3 00:28:12 PDT 2025

https://github.com/Pierre-vh updated https://github.com/llvm/llvm-project/pull/156424

>From 25cfff549deef453e5d441d46042885d8e754e7c Mon Sep 17 00:00:00 2001
From: pvanhout <pierre.vanhoutryve at amd.com>
Date: Tue, 2 Sep 2025 10:41:52 +0200
Subject: [PATCH 1/2] [Doc][AMDGPU] Document the waitcnts required before
 SCOPE_SYS stores on GFX12

This case was undocumented until now.
---
 llvm/docs/AMDGPUUsage.rst | 25 ++++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index afd0d9e7539ef..e5f9d5f021c04 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -14510,6 +14510,14 @@ For GFX12:
 * A memory attached last level (MALL) cache exists for GPU memory.
   The MALL cache is fully coherent with GPU memory and has no impact on system
   coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
+* The wait instructions below must be added before any ``SCOPE_SYS`` store in
+  order for the store to remain in order with previous memory operations.
+
+  * ``s_wait_loadcnt 0x0``
+  * ``s_wait_storecnt 0x0``
+  * ``s_wait_kmcnt 0x0``
+  * ``s_wait_samplecnt 0x0``
+  * ``s_wait_bvhcnt 0x0``
 
 Scalar memory operations are only used to access memory that is proven to not
 change during the execution of the kernel dispatch. This includes constant
@@ -14669,7 +14677,20 @@ the instruction in the code sequence that references the table.
                                - wavefront    - generic
                                - workgroup                 - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
                                - agent
-                               - system
+     store atomic monotonic    - system       - global   1. | ``s_wait_loadcnt 0x0``
+                                              - generic     | ``s_wait_storecnt 0x0``
+                                                            | ``s_wait_kmcnt 0x0``
+                                                            | ``s_wait_samplecnt 0x0``
+                                                            | ``s_wait_bvhcnt 0x0``
+
+                                                           - The waits can be independently moved as long as the
+                                                             counter they wait on is known to be zero before issuing
+                                                             the following store instruction.
+
+                                                         2. buffer/global/flat_store
+
+                                                           - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+
      store atomic monotonic    - singlethread - local    1. ds_store
                                - wavefront
                                - workgroup
@@ -15255,7 +15276,9 @@ the instruction in the code sequence that references the table.
                                                             | ``s_wait_storecnt 0x0``
                                                             | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_dscnt 0x0``
+                                                            | ``s_wait_kmcnt 0x0``
 
+                                                           - If agent scope, omit ``s_wait_kmcnt 0x0``.
                                                            - If OpenCL, omit ``s_wait_dscnt 0x0``.
                                                            - The waits can be
                                                              independently moved

>From 612631be11a42c4d751f52e93b21adf486eef08c Mon Sep 17 00:00:00 2001
From: pvanhout <pierre.vanhoutryve at amd.com>
Date: Wed, 3 Sep 2025 09:27:56 +0200
Subject: [PATCH 2/2] Reorder waits in docs

---
 llvm/docs/AMDGPUUsage.rst | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index e5f9d5f021c04..abd7b210eb0eb 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -15271,12 +15271,12 @@ the instruction in the code sequence that references the table.
                                - system       - generic
                                                             - If agent scope, omit.
 
-                                                         2. | ``s_wait_bvhcnt 0x0``
-                                                            | ``s_wait_samplecnt 0x0``
+                                                         2. | ``s_wait_loadcnt 0x0``
                                                             | ``s_wait_storecnt 0x0``
-                                                            | ``s_wait_loadcnt 0x0``
-                                                            | ``s_wait_dscnt 0x0``
                                                             | ``s_wait_kmcnt 0x0``
+                                                            | ``s_wait_samplecnt 0x0``
+                                                            | ``s_wait_bvhcnt 0x0``
+                                                            | ``s_wait_dscnt 0x0``
 
                                                            - If agent scope, omit ``s_wait_kmcnt 0x0``.
                                                            - If OpenCL, omit ``s_wait_dscnt 0x0``.