[llvm] [Doc][AMDGPU] Add barrier execution & memory model (PR #170447)

Wed Dec 3 01:46:40 PST 2025

llvmbot wrote:




@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

<details>
<summary>Changes</summary>

Add a formal execution model, and a memory model for the execution barrier
primitives available in GFX12.0 and below.
The model also works for GFX12.5 workgroup/workgroup trap barriers, but does
not include the new barrier types and instructions added in GFX12.5.
These will be added at a later date.

---
Full diff: https://github.com/llvm/llvm-project/pull/170447.diff


1 Files Affected:

- (modified) llvm/docs/AMDGPUUsage.rst (+317) 


``````````diff

diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index c28d6ce792184..ef44344b54e55 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -1601,6 +1601,20 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
                                                    * 1 - Data cache.
 
                                                    Instruction cache prefetches are unsafe on invalid address.
+
+
+  llvm.amdgcn.s.barrier                            Performs a barrier *signal* operation immediately followed
+                                                   by a barrier *wait* operation on the *workgroup barrier* object.
+                                                   see :ref:`amdgpu-amdhsa-execution-barriers`.
+
+  llvm.amdgcn.s.barrier.signal                     Performs a barrier *signal* operation on the barrier *object* determined by the ``i32`` immediate argument.
+                                                   See :ref:`amdgpu-amdhsa-execution-barriers`.
+                                                   Available starting GFX12.
+
+  llvm.amdgcn.s.barrier.wait                       Performs a barrier *wait* operation on the barrier *object* determined by the ``i16`` immediate argument.
+                                                   See :ref:`amdgpu-amdhsa-execution-barriers`.
+                                                   Available starting GFX12.
+
   ==============================================   ==========================================================
 
 .. TODO::
@@ -6553,6 +6567,281 @@ The Private Segment Buffer is always requested, but the Private Segment
 Wavefront Offset is only requested if it is used (see
 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 
+.. _amdgpu-amdhsa-execution-barriers:
+
+Execution Barriers
+~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+  This specification is a work-in-progress (see lines annotated with :sup:`WIP`), and is not complete for GFX12.5.
+
+Threads can synchronize execution by performing barrier operations on barrier *objects* as described below:
+
+* Barrier *objects* have the following state:
+
+  * An unsigned non-zero positive integer *expected count*: counts the number of *signal* operations
+    expected for this barrier *object*.
+  * An unsigned positive integer *signal count*: counts the number of *signal* operations
+    already performed on this barrier *object*.
+
+      * The initial value of *signal count* is zero.
+      * When an operation causes *signal count* to be equal to *expected count*, the barrier is completed,
+        and the *signal count* is reset to zero.
+
+* Barrier operations are performed on barrier *objects*. A barrier operation is a dynamic instance
+  of one of the following:
+
+  * Barrier *init*.
+  * Barrier *join*.
+  * Barrier *leave*: decrements *expected count* of the barrier *object* by one.
+  * Barrier *signal*: increments *signal count* of the barrier *object* by one.
+  * Barrier *wait*.
+
+* Barrier modification operations are barrier operations that modify the barrier *object* state:
+
+  * Barrier *init*.
+  * Barrier *leave*.
+  * Barrier *signal*.
+
+* For a given barrier *object* ``BO``:
+
+  * There is exactly one barrier *init* for ``BO``. :sup:`WIP`
+  * *Thread-barrier-order<BO>* is the subset of *program-order* that only
+    relates barrier operations performed on ``BO``.
+  * Let ``S`` be the set of barrier modification operations on ``BO``, then
+    *barrier-modification-order<BO>* is a strict total order over ``S``. It is the order
+    in which ``BO`` observes barrier operations that change its state.
+
+    * *Barrier-modification-order<BO>* is consistent with *happens-before*.
+    * The first element in *barrier-modification-order<BO>* is a barrier *init*.
+      There is only one barrier *init* in *barrier-modification-order<BO>*
+
+  * *Barrier-joined-before<BO>* is a strict partial order over barrier operations on ``BO``.
+    A barrier *join* ``J`` is *barrier-joined-before<BO>* a barrier operation ``X`` if and only if all
+    of the following is true:
+
+    * ``J -> X`` in *thread-barrier-order<BO>*.
+    * There is no barrier *leave* ``L`` where ``J -> L -> X`` in *thread-barrier-order<BO>*.
+
+  * *Barrier-participates-in<BO>* is a partial order that relates barrier operations to barrier *waits*.
+    A barrier operation ``X`` may *barrier-participates-in<BO>* a barrier *wait* ``W`` if all of the following
+    is true:
+
+    * ``X`` and ``W`` are both performed on ``BO``.
+    * ``X`` is a barrier *signal* or *leave* operation.
+    * ``X`` does not *barrier-participates-in<BO>* another barrier *wait* ``W'`` in the same thread as ``W``.
+    * ``W -> X`` **not** in *thread-barrier-order<BO>*.
+
+* Let ``S`` be the set of barrier operations that *barrier-participates-in<BO>* a barrier *wait* ``W`` for some
+  barrier *object* ``BO``, then all of the following is true:
+
+  * ``S`` cannot be empty. :sup:`WIP`
+  * The elements of ``S`` are all ordered by a continuous interval of *barrier-modification-order<BO>*.
+  * Let ``A`` be the first operation of ``S`` in *barrier-modification-order<BO>*, then the *signal count* of ``BO``
+    is zero before ``A`` is performed.
+  * Let ``B`` be the last operation of ``S`` in *barrier-modification-order<BO>*, then the *signal count* and
+    *expected count* of ``BO`` are equal after ``B`` is performed. ``B`` is the only barrier operation in ``S``
+    that causes the *signal count* and *expected count* of ``BO`` to be equal.
+
+* For every barrier *signal* ``S`` performed on a barrier *object* ``BO``:
+
+  * The immediate successor of ``S`` in *thread-barrier-order<BO>* is a barrier *wait*. :sup:`WIP`
+
+* For every barrier *wait* ``W`` performed on a barrier *object* ``BO``:
+
+  * There is a barrier *join* ``J`` such that ``J -> W`` in *barrier-joined-before<BO>*. :sup:`WIP`
+
+* For every barrier *join* ``J`` performed on a barrier *object* ``BO``:
+
+  * There is no other barrier operation *thread-barrier-ordered<BO>* before ``J``. :sup:`WIP`
+  * ``J`` is not *barrier-joined-before<BO>* another barrier *join*.
+
+* For every barrier *leave* ``L`` performed on a barrier *object* ``BO``:
+
+  * There is no other barrier operation *thread-barrier-ordered<BO>* after ``L``. :sup:`WIP`
+
+* *Barrier-executes-before* is a strict partial order of all barrier operations. It is a union of all the following
+  orders:
+
+  * *Thread-barrier-order<BO>* for every barrier object ``BO``.
+  * *Barrier-participates-in<BO>* for every barrier object ``BO``.
+
+* *Barrier-executes-before* is consistent with *program-order*.
+
+*Barrier-executes-before* represents the order in which barrier operations will complete by relating operations
+from different threads together.
+For example, if ``A -> B`` in *barrier-executes-before*, then the execution of ``A`` must complete
+before the execution of ``B`` can complete.
+
+When a barrier *signal* ``S`` *barrier-executes-before* a barrier *wait* ``W``, ``S`` executes before ``W``
+**as-if** ``S`` is *program-ordered* before ``W``. Thus, all *dynamic instances* *program-ordered* before ``S``
+are known to have been executed before the *dynamic instances* *program-ordered* after ``W``.
+
+  .. note::
+
+    Barriers only synchronize execution, not memory: ``S -> W`` in *barrier-executes-before* does not imply
+    ``S`` *happens-before* ``W``. Refer to the :ref:`execution barriers memory model<amdgpu-amdhsa-execution-barriers-memory-model>`
+    to also synchronize memory.
+
+Target-Specific Properties
+++++++++++++++++++++++++++
+
+This section covers properties of barrier operation and *objects* that are specific to the implementation of
+barriers in AMDGPU hardware.
+
+Barrier operations have the following additional target-specific properties:
+
+* Barrier operations are convergent within a wave. All threads of a wavefront use the same barrier *object* when
+  performing any barrier operation.
+
+All barrier *objects* have the following additional target-specific properties:
+
+* Barrier *join* does not increment the *expected count* of a barrier *object*. The *expected count* is set
+  during initialization of the barrier by the hardware. :sup:`WIP`
+* Barrier *objects* are allocated and managed by the hardware.
+
+  * Barrier *objects* are stored in an inaccessible memory location.
+
+* Barrier *objects* exist within a *scope* (see :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`),
+  and can only be accessed by threads in that *scope*.
+
+See :ref:`amdgpu-llvm-ir-intrinsics-table` for more information on how to perform barrier operations using
+LLVM IR intrinsic calls, or see the sections below to perform barrier operations using machine code.
+
+Informational Notes
++++++++++++++++++++
+
+Informally, we can deduce from the above formal model that execution barrier behave as follow:
+
+* Synchronization of threads always happens at a wavefront granularity.
+* When a barrier *signal* or *leave* causes the *signal count* of a barrier *object* to be identical to the
+  *expected count*, the *signal count* is reset to zero, and threads that have *joined* the barrier *object*
+  will:
+
+  * Wake-up if they were sleeping because of a barrier *wait*, **or**
+  * Skip the next barrier *wait* operation if they have not previously *waited*.
+
+Execution Barrier GFX6-11
++++++++++++++++++++++++++
+
+Targets from GFX6 through GFX11 included do not have the split barrier feature.
+The barrier *signal* and barrier *wait* operations cannot be performed independently.
+
+There is only one *workgroup barrier* object of ``workgroup`` scope that is implicitly used
+by all barrier operations.
+
+  .. table:: AMDHSA Execution Barriers Code Sequences GFX6-GFX11
+     :name: amdgpu-amdhsa-execution-barriers-code-sequences-gfx6-gfx11-table
+
+     ===================== ====================== ===========================================================
+     Barrier Operation(s)  Barrier *Object*       AMDGPU Machine Code
+     ===================== ====================== ===========================================================
+     **Init, Join and Leave**
+     --------------------------------------------------------------------------------------------------------
+     *init*                - *Workgroup barrier*  The hardware automatically initializes this barrier *object*
+                                                  when a workgroup is launched: The *expected count* of the
+                                                  *workgroup barrier* barrier object is set to the number of
+                                                  waves in the workgroup
+
+                                                  The number of waves in the workgroup is not known until all
+                                                  waves of the workgroup have launched. Thus, the
+                                                  *expected count* of this barrier *object* can never be
+                                                  equal to its *signal count* until all wavefronts of the
+                                                  workgroup launched.
+
+     *join*                - *Workgroup barrier*  Any thread launched within a workgroup automatically
+                                                  *joins* this barrier *object*.
+
+     *leave*               - *Workgroup barrier*  When a thread ends, it automatically *leaves* this
+                                                  barrier *object* if it had previously *joined* it.
+
+     **Signal and Wait**
+     --------------------------------------------------------------------------------------------------------
+     *signal* then *wait*  - *Workgroup barrier*  | **BackOffBarrier**
+                                                  | ``s_barrier``
+                                                  | **No BackOffBarrier**
+                                                  | ``s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)``
+                                                  | ``s_waitcnt_vscnt null, 0x0``
+                                                  | ``s_barrier``
+
+                                                  - If the target does not have the BackOffBarrier feature,
+                                                    then there cannot be any outstanding memory operations
+                                                    before issuing the ``s_barrier`` instruction.
+                                                  - The waitcnts can independently be moved earlier, or
+                                                    removed entirely as long as the associated
+                                                    counter remains at zero before issuing the
+                                                    ``s_barrier`` instruction.
+
+     *signal*              - *Workgroup barrier*  Not available separately, see *signal* then *wait*
+
+     *wait*                - *Workgroup barrier*  Not available separately, see *signal* then *wait*
+     ===================== ====================== ===========================================================
+
+Execution Barrier GFX12
++++++++++++++++++++++++
+
+.. note::
+
+  This is incomplete for GFX12.5.
+
+GFX12 targets have the split-barrier feature, and also offer multiple barrier *objects* per workgroup
+(see :ref:`amdgpu-amdhsa-execution-barriers-ids-gfx12-table`).
+
+  .. table:: AMDHSA Execution Barriers Code Sequences GFX12
+     :name: amdgpu-amdhsa-execution-barriers-code-sequences-gfx12-table
+
+     ===================== =========================== ===========================================================
+     Barrier Operation(s)  Barrier *Object*            AMDGPU Machine Code
+     ===================== =========================== ===========================================================
+     **Join and Leave**
+     -------------------------------------------------------------------------------------------------------------
+     *init*                - *Workgroup barrier*       The hardware automatically initializes this barrier *object*
+                           - *Workgroup trap barrier*  when a workgroup is launched: The *expected count* of the
+                                                       barrier object is set to the number of
+                                                       waves in the workgroup
+
+                                                       The number of waves in the workgroup is not known until all
+                                                       waves of the workgroup have launched. Thus, the
+                                                       *expected count* of this barrier *object* can never be
+                                                       equal to its *signal count* until all wavefronts of the
+                                                       workgroup launched.
+
+     *join*                - *Workgroup barrier*       Any thread launched within a workgroup automatically
+                           - *Workgroup trap barrier*  *joins* this barrier *object*.
+
+     *leave*               - *Workgroup barrier*       When a thread ends, it automatically *leaves* this
+                           - *Workgroup trap barrier*  barrier *object* if it had previously *joined* it.
+
+     **Signal and Wait**
+     -------------------------------------------------------------------------------------------------------------
+
+     *signal*              - *Workgroup barrier*       | ``s_barrier_signal -1``
+                                                       | Or
+                                                       | ``s_barrier_signal_isfirst -1``
+
+
+     *wait*                - *Workgroup barrier*       ``s_barrier_wait -1``.
+
+     *signal*              - *Workgroup trap barrier*  Not available to the shader.
+
+     *wait*                - *Workgroup trap barrier*  Not available to the shader.
+     ===================== =========================== ===========================================================
+
+
+  .. table:: AMDHSA Execution Barriers IDs GFX12
+     :name: amdgpu-amdhsa-execution-barriers-ids-gfx12-table
+
+     =========== ============== ==============================================================
+     Barrier ID  Scope          Description
+     =========== ============== ==============================================================
+     ``-2``      ``workgroup``  *Workgroup trap barrier*, dedicated for the trap handler and
+                                only available in privileged execution mode
+                                (not accessible by the shader).
+
+     ``-1``      ``workgroup``  *Workgroup barrier*.
+     =========== ============== ==============================================================
+
 .. _amdgpu-amdhsa-memory-model:
 
 Memory Model
@@ -6678,6 +6967,30 @@ following sections:
 * :ref:`amdgpu-amdhsa-memory-model-gfx12`
 * :ref:`amdgpu-amdhsa-memory-model-gfx125x`
 
+.. _amdgpu-amdhsa-execution-barriers-memory-model:
+
+Execution Barriers
+++++++++++++++++++
+
+.. note::
+  See :ref:`amdgpu-amdhsa-execution-barriers` for definitions of the terminology used
+  in this section.
+
+* A barrier *signal* operation ``S`` can pair with a release fence program-ordered before it
+  to form a ``barrier-signal-release`` ``BR``. The synchronization scope and the set of address
+  spaces affected are determined by the release fence.
+* A barrier *wait* operation ``W`` can pair with an acquire fence program-ordered after it to
+  form a ``barrier-wait-acquire`` ``BA``. The synchronization scope and the set of address
+  spaces affected are determined by the acquire fence.
+
+A ``BR`` *synchronizes-with* ``BA`` in an address space *AS* if and only if:
+
+* ``S`` *barrier-executes-before* ``W``.
+* *BA* and *BR*'s :ref:`synchronization scope<amdgpu-memory-scopes>` overlap.
+* *BA* and *BR*'s :ref:`synchronization scope<amdgpu-memory-scopes>`
+  allow cross address space synchronization (they cannot have ``one-as``).
+* *BA* and *BR*'s address spaces both include *AS*.
+
 .. _amdgpu-fence-as:
 
 Fence and Address Spaces
@@ -6797,6 +7110,10 @@ only accessed by a single thread, and is always write-before-read, there is
 never a need to invalidate these entries from the L1 cache. Hence all cache
 invalidates are done as ``*_vol`` to only invalidate the volatile cache lines.
 
+A wave waiting on a ``s_barrier`` is unable to handle traps or exceptions,
+thus a ``s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)`` is required before entering
+the barrier so that no memory exception can occur during the barrier.
+
 The code sequences used to implement the memory model for GFX6-GFX9 are defined
 in table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx6-gfx9-table`.
 

``````````

</details>


https://github.com/llvm/llvm-project/pull/170447