[llvm] [AMDGPU][Doc] GFX12.5 Barrier Execution Model (PR #185632)

Pierre van Houtryve via llvm-commits llvm-commits at lists.llvm.org
Tue Mar 10 05:18:14 PDT 2026


https://github.com/Pierre-vh created https://github.com/llvm/llvm-project/pull/185632

- Document GFX12.5-specific intrinsics.
- Rename signal -> arrive, leave -> drop to match C++ terminology.
- Update execution model to support GFX12.5 semantics (e.g. threads can arrive w/o waiting)
- Various clean-ups & wording updates on the model.
- Added "mutually exclusive" barrier objects.
- Added barrier-phase-with + related constraints.
- Document that barriers can exist at cluster scope too.
- Update GFX12 target semantics/code sequences to include GFX12.5.

The model is no longer marked as incomplete, it is now just experimental.

There are more updates planned in the future to support more features, and
improve some known shortcomings of the model. e.g., currently many relations
encode too much semantic information, which means the model doesn't build
when barriers aren't used correctly. I'd like the model to eventually represent
broken executions as well, just like a memory model can.

>From 3dd285c07b5b0b92a2d1cd1f1b5ea6115882eeb3 Mon Sep 17 00:00:00 2001
From: pvanhout <pierre.vanhoutryve at amd.com>
Date: Tue, 10 Mar 2026 10:37:43 +0100
Subject: [PATCH] [AMDGPU][Doc] GFX12.5 Barrier Execution Model

- Document GFX12.5-specific intrinsics.
- Rename signal -> arrive, leave -> drop to match C++ terminology.
- Update execution model to support GFX12.5 semantics (e.g. threads can arrive w/o waiting)
- Various clean-ups & wording updates on the model.
- Added "mutually exclusive" barrier objects.
- Added barrier-phase-with + related constraints.
- Document that barriers can exist at cluster scope too.
- Update GFX12 target semantics/code sequences to include GFX12.5.

The model is no longer marked as incomplete, it is now just experimental.

There are more updates planned in the future to support more features, and
improve some known shortcomings of the model. e.g., currently many relations
encode too much semantic information, which means the model doesn't build
when barriers aren't used correctly. I'd like the model to eventually represent
broken executions as well, just like a memory model can.
---
 llvm/docs/AMDGPUUsage.rst | 535 ++++++++++++++++++++++++--------------
 1 file changed, 342 insertions(+), 193 deletions(-)

diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 036b4461ec06d..9d9c1bc18492f 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -1629,10 +1629,6 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
 
                                                    The iglp_opt strategy implementations are subject to change.
 
-  llvm.amdgcn.s.barrier.signal.isfirst             Provides access to the s_barrier_signal_first instruction;
-                                                   additionally ensures that the result value is valid even when the
-                                                   intrinsic is used from a wave that is not running in a workgroup.
-
   llvm.amdgcn.s.getpc                              Provides access to the s_getpc_b64 instruction, but with the return value
                                                    sign-extended from the width of the underlying PC hardware register even on
                                                    processors where the s_getpc_b64 instruction returns a zero-extended value.
@@ -1699,18 +1695,44 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
 
                                                    Instruction cache prefetches are unsafe on invalid address.
 
-  llvm.amdgcn.s.barrier                            Performs a barrier *signal* operation immediately followed
+  llvm.amdgcn.s.barrier                            Performs a barrier *arrive* operation immediately followed
                                                    by a barrier *wait* operation on the *workgroup barrier* object.
                                                    see :ref:`amdgpu-amdhsa-execution-barriers`.
 
-  llvm.amdgcn.s.barrier.signal                     Performs a barrier *signal* operation on the barrier *object* determined by the ``i32`` immediate argument.
+  llvm.amdgcn.s.barrier.init                       Performs a barrier *init* operation on the barrier *object* determined by the first operand.
+                                                   See :ref:`amdgpu-amdhsa-execution-barriers`.
+                                                   Available starting GFX12.5.
+
+  llvm.amdgcn.s.barrier.signal                     Performs a barrier *arrive* operation on the barrier *object* determined by the ``i32`` immediate argument.
+                                                   See :ref:`amdgpu-amdhsa-execution-barriers`.
+                                                   Available starting GFX12.
+
+  llvm.amdgcn.s.barrier.signal.var                 Performs a barrier *arrive* operation on the barrier *object* determined by the first argument.
+                                                   The second argument is an ``i32`` immediate *expected count*. The *expected count* of the barrier *object*
+                                                   is only set when the argument is not zero, and when the barrier *object* is a *named barrier object*.
+                                                   See :ref:`amdgpu-amdhsa-execution-barriers`.
+                                                   Available starting GFX12.
+
+  llvm.amdgcn.s.barrier.signal.isfirst             Performs a barrier *arrive* operation on the barrier *object* determined by the ``i32`` immediate argument.
+                                                   Additionally ensures that the result value is valid even when the intrinsic is used from a wave
+                                                   that is not running in a workgroup.
                                                    See :ref:`amdgpu-amdhsa-execution-barriers`.
                                                    Available starting GFX12.
 
   llvm.amdgcn.s.barrier.wait                       Performs a barrier *wait* operation on the barrier *object* determined by the ``i16`` immediate argument.
+                                                   If waiting on a *named barrier object*, this instruction always waits on the last *named barrier object*
+                                                   that the thread has *joined*, even if it is different from the argument.
                                                    See :ref:`amdgpu-amdhsa-execution-barriers`.
                                                    Available starting GFX12.
 
+  llvm.amdgcn.s.barrier.join                       Performs a barrier *join* operation on the barrier *object* determined by the first operand.
+                                                   See :ref:`amdgpu-amdhsa-execution-barriers`.
+                                                   Available starting GFX12.5.
+
+  llvm.amdgcn.s.barrier.leave                      Performs a barrier *drop* operation.
+                                                   See :ref:`amdgpu-amdhsa-execution-barriers`.
+                                                   Available starting GFX12.5.
+
   llvm.amdgcn.flat.load.monitor                    Available on GFX12.5 only.
                                                    Corresponds to ``flat_load_monitor_b32/64/128`` (``.b32/64/128`` suffixes)
                                                    instructions.
@@ -6767,117 +6789,155 @@ Execution Barriers
 
 .. note::
 
-  This specification is a work-in-progress (see lines annotated with :sup:`WIP`), and is not complete for GFX12.5.
+  The barrier execution model is experimental and subject to change.
 
 Threads can synchronize execution by performing barrier operations on barrier *objects* as described below:
 
 * Each barrier *object* has the following state:
 
-  * An unsigned positive integer *expected count*: counts the number of *signal* operations
+  * An unsigned positive integer *expected count*: counts the number of *arrive* operations
     expected for this barrier *object*.
-  * An unsigned non-negative integer *signal count*: counts the number of *signal* operations
+  * An unsigned non-negative integer *arrive count*: counts the number of *arrive* operations
     already performed on this barrier *object*.
 
-      * The initial value of *signal count* is zero.
-      * When an operation causes *signal count* to be equal to *expected count*, the barrier is completed,
-        and the *signal count* is reset to zero.
+      * The initial value of *arrive count* is zero.
+      * When an operation causes *arrive count* to be equal to *expected count*, the barrier is completed,
+        and the *arrive count* is reset to zero.
 
+* *Barrier-mutually-exclusive* is a symmetric relation between barrier *objects* that represents barrier
+  *objects* that share resources in a way that prevents a thread from using them at the same time.
 * Barrier operations are performed on barrier *objects*. A barrier operation is a dynamic instance
   of one of the following:
 
-  * Barrier *init*.
+  * Barrier *init*
+
+    * Barrier *init* takes an additional unsigned positive integer argument *k*.
+    * Sets the *expected count* of the *barrier object* to *k*.
+    * Resets the *arrive count* of the *barrier object* to zero.
+
   * Barrier *join*.
-  * Barrier *leave*.
+  * Barrier *drop*.
 
     * Decrements *expected count* of the barrier *object* by one.
 
-  * Barrier *signal*.
+  * Barrier *arrive*.
 
-    * Increments *signal count* of the barrier *object* by one.
+    * Increments the *arrive count* of the barrier *object* by one.
+    * Depending on the implementation, *arrive* can also update the *expected count* of the
+      barrier *object* before the *arrive count* is incremented;
+      the new *expected count* can never be less than or equal to the *arrive count*,
+      otherwise the behavior is undefined.
 
   * Barrier *wait*.
 
 * Barrier modification operations are barrier operations that modify the barrier *object* state:
 
   * Barrier *init*.
-  * Barrier *leave*.
-  * Barrier *signal*.
+  * Barrier *drop*.
+  * Barrier *arrive*.
 
-* For a given barrier *object* ``BO``:
+* For a given barrier *object* ``BO``, the following relations exist in any
+  valid program execution:
 
-  * There is exactly one barrier *init* for ``BO``. :sup:`WIP`
   * *Thread-barrier-order<BO>* is the subset of *program-order* that only
     relates barrier operations performed on ``BO``.
-  * Let ``S`` be the set of barrier modification operations on ``BO``, then
-    *barrier-modification-order<BO>* is a strict total order over ``S``. It is the order
-    in which ``BO`` observes barrier operations that change its state.
+  * All barrier modification operations on ``BO`` occur in a strict total order called
+    *barrier-modification-order<BO>*; It is the order in which ``BO`` observes barrier
+    operations that change its state. For any valid *barrier-modification-order<BO>*, the
+    following must be true:
 
-    * *Barrier-modification-order<BO>* is consistent with *happens-before*.
     * Let ``A`` and ``B`` be two barrier modification operations where ``A -> B`` in
-      *thread-barrier-order<BO>*, then ``A -> B`` in *barrier-modification-order<BO>*.
-    * The first element in *barrier-modification-order<BO>* is a barrier *init*.
-      There is only one barrier *init* in *barrier-modification-order<BO>*.
-
-  * *Barrier-joined-before<BO>* is a strict partial order over barrier operations on ``BO``.
-    A barrier *join* ``J`` is *barrier-joined-before<BO>* a barrier operation ``X`` if and only if all
-    of the following is true:
+      *thread-barrier-order<BO>*, then ``A -> B`` is also in *barrier-modification-order<BO>*.
+    * The first element in *barrier-modification-order<BO>* is always a barrier *init*, otherwise
+      the behavior is undefined.
 
-    * ``J -> X`` in *thread-barrier-order<BO>*.
-    * There is no barrier *leave* ``L`` where ``J -> L -> X`` in *thread-barrier-order<BO>*.
-
-  * *Barrier-participates-in<BO>* is a partial order that relates barrier operations to barrier *waits*.
-    A barrier operation ``X`` may *barrier-participates-in<BO>* a barrier *wait* ``W`` if all of the following
-    is true:
+  * *Barrier-participates-in<BO>* relates barrier operations to the barrier *waits* that depend on them
+    to complete. A barrier operation ``X`` may *barrier-participates-in<BO>* a barrier *wait* ``W``
+    if and only if all of the following is true:
 
     * ``X`` and ``W`` are both performed on ``BO``.
-    * ``X`` is a barrier *signal* or *leave* operation.
-    * ``X`` does not *barrier-participates-in<BO>* another barrier *wait* ``W'`` in the same thread as ``W``.
-    * ``W -> X`` **not** in *thread-barrier-order<BO>*.
-
-  * *Barrier-participates-in<BO>* is consistent with *happens-before*.
+    * ``X`` is a barrier *arrive* or *drop* operation.
+    * ``X`` does not already *barrier-participate-in<BO>* a distinct barrier *wait* ``W'`` in the same thread as ``W``.
+    * ``W -> X`` not in *thread-barrier-order<BO>*.
+    * All dependent constraint and relations are satisfied as well. [0]_
 
 * Let ``S`` be the set of barrier operations that *barrier-participate-in<BO>* a barrier *wait* ``W`` for some
-  barrier *object* ``BO``, then all of the following is true:
+  barrier *object* ``BO``, then all of the following is true.
 
-  * ``S`` cannot be empty.
   * The elements of ``S`` all exist in a continuous interval of *barrier-modification-order<BO>*.
-  * Let ``A`` be the first operation of ``S`` in *barrier-modification-order<BO>*, then the *signal count* of ``BO``
+  * Let ``A`` be the first operation of ``S`` in *barrier-modification-order<BO>*, then the *arrive count* of ``BO``
     is zero before ``A`` is performed.
-  * Let ``B`` be the last operation of ``S`` in *barrier-modification-order<BO>*, then the *signal count* and
+  * Let ``B`` be the last operation of ``S`` in *barrier-modification-order<BO>*, then the *arrive count* and
     *expected count* of ``BO`` are equal after ``B`` is performed. ``B`` is the only barrier operation in ``S``
-    that causes the *signal count* and *expected count* of ``BO`` to be equal.
+    that causes the *arrive count* and *expected count* of ``BO`` to be equal.
+
+* A barrier *join* ``J`` is *barrier-joined-before* a barrier operation ``X`` if and only if all
+  of the following is true:
+
+  * ``J -> X`` in *thread-barrier-order<BO>*.
+  * ``X`` is not a barrier *join*.
+  * There is no barrier *join* or *drop* ``JD`` where ``J -> JD -> X`` in *thread-barrier-order<BO>*.
+  * There is no barrier *join* ``J'`` on a distinct barrier *object* ``BO'`` such that ``J -> J' -> X`` in
+    *program-order*, and ``BO`` *barrier-mutually-exclusive* ``BO'``.
+
+* A barrier operation ``A`` *barrier-executes-before* another barrier operation ``B`` if any of the
+  following is true:
+
+  * ``A -> B`` in *program-order*.
+  * For some barrier *object* ``BO``, ``A-> B`` in *barrier-participates-in<BO>*.
+  * ``A`` *barrier-executes-before* some barrier operation ``X``, and ``X``
+    *barrier-executes-before* ``B``.
 
-* For every barrier *signal* ``S`` performed on a barrier *object* ``BO``:
+* *Barrier-executes-before* is consistent with *barrier-modification-order<BO>*
+  for every barrier object ``BO``.
+* For every barrier *drop* ``D`` performed on a barrier *object* ``BO``:
 
-  * The immediate successor of ``S`` in *thread-barrier-order<BO>* is a barrier *wait*. :sup:`WIP`
+  * There is a barrier *join* ``J`` such that ``J -> D`` in *barrier-joined-before*;
+    otherwise, the behavior is undefined.
+  * ``D`` cannot cause the *expected count* of ``BO`` to become negative; otherwise, the behavior is undefined.
+
+* For every pair of barrier *arrive* ``A`` and barrier *drop* ``D`` performed on a barrier *object*
+  ``BO``, such that ``A -> D`` in *thread-barrier-ordered<BO>*, one of the following must be true:
+
+  * If ``A`` does not *barrier-participates-in<BO>* any barrier *wait*.
+  * ``A`` *barrier-participates-in<BO>* at least one barrier *wait* ``W``
+    such that  ``W -> D`` in *barrier-executes-before*.
 
 * For every barrier *wait* ``W`` performed on a barrier *object* ``BO``:
 
-  * There is a barrier *join* ``J`` such that ``J -> W`` in *barrier-joined-before<BO>*. :sup:`WIP`
+  * There is at least one barrier operation that *barrier-participates-in<BO>* ``W``.
+  * There is a barrier *join* ``J`` such that ``J -> W`` in *barrier-joined-before*.
+  * ``J`` must *barrier-executes-before<BO>* at least one operation ``X`` that
+    *barrier-participates-in<BO>* ``W``; otherwise, the behavior is undefined.
 
 * For every barrier *join* ``J`` performed on a barrier *object* ``BO``:
 
-  * ``J`` is not *barrier-joined-before<BO>* another barrier *join*.
+  * ``J`` is not *barrier-joined-before* another barrier *join*.
 
-* *Barrier-executes-before* is a strict partial order defined over the union of all barrier operations
-  performed by all threads on all barriers. It is the transitive closure of all the following orders:
+* A barrier operation ``A`` is *barrier-phase-with* another barrier operation ``B`` if and only if:
 
-  * *Thread-barrier-order<BO>* for every barrier object ``BO``.
-  * *Barrier-participates-in<BO>* for every barrier object ``BO``.
+  * There exist some barrier *wait* ``W`` such that both ``A -> W`` and ``B -> W`` in
+    *barrier-participates-in<BO>* for some barrier *object* ``BO``.
 
-* *Barrier-executes-before* is consistent with *program-order*.
-* For every barrier *object* ``BO``:
+* For every barrier operation ``A`` that *barrier-executes-before* another barrier operation ``B``,
+  at least one of the following statements is true; otherwise, the behavior is undefined:
 
-  * *Barrier-modification-order<BO>* is consistent with *barrier-executes-before*.
+  * ``A`` and ``B`` are performed on different barrier *objects*, or
+  * ``A`` *barrier-phase-with* ``B``, or
+  * There is no barrier operation ``X`` such that ``B -> X`` in *barrier-executes-before* and
+    ``A -> X`` in *barrier-phase<BO>*.
 
-* For every barrier operation ``X`` on a barrier *object* ``BO``,
-  there is a barrier *init* ``I`` on ``BO`` such that ``I`` *barrier-executes-before* ``X``.
+.. note::
 
-  .. note::
+  Barriers only synchronize execution and do not affect the visibility of memory operations between threads.
+  Refer to the :ref:`execution barriers memory model<amdgpu-amdhsa-execution-barriers-memory-model>`
+  to determine how to synchronize memory operations through *barrier-executes-before*.
 
-    Barriers only synchronize execution and do not affect the visibility of memory operations between threads.
-    Refer to the :ref:`execution barriers memory model<amdgpu-amdhsa-execution-barriers-memory-model>`
-    to determine how to synchronize memory operations through *barrier-executes-before*.
+
+.. [0] The definition of *barrier-participates-in<BO>* (in its current state) is non-deterministic and
+       will be improved in the future: Within a valid execution, there may be multiple ways
+       to build *barrier-participates-in<BO>*, however there is only one way to build it that also satisfies all
+       other relations and constraints that depend on *barrier-participates-in<BO>* and relations derived from it.
 
 Target-Specific Properties
 ++++++++++++++++++++++++++
@@ -6895,7 +6955,7 @@ Barrier operations have the following additional target-specific properties:
 All barrier *objects* have the following additional target-specific properties:
 
 * Barrier *join* does not increment the *expected count* of a barrier *object*. The *expected count* is set
-  during initialization of the barrier by the hardware. :sup:`WIP`
+  during initialization of the barrier by the hardware.
 * Barrier *objects* are allocated and managed by the hardware.
 
   * Barrier *objects* are stored in an unspecified memory region that does not alias with
@@ -6903,42 +6963,15 @@ All barrier *objects* have the following additional target-specific properties:
     memory operation in any other :ref:`address space<amdgpu-address-spaces>`.
 
 * Barrier *objects* exist within a *scope* (see :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`),
-  and can only be accessed by threads in that *scope*.
+  and each instance of a barrier *object* can only be accessed by threads in the *scope* where
+  the instance lives. The following scopes are supported:
+
+  * ``workgroup``.
+  * ``cluster``.
 
 See :ref:`amdgpu-llvm-ir-intrinsics-table` for more information on how to perform barrier operations using
 LLVM IR intrinsic calls, or see the sections below to perform barrier operations using machine code.
 
-.. _amdgpu-amdhsa-execution-barriers-workgroup-barriers:
-
-Workgroup Barrier Operations
-############################
-
-.. note::
-
-  This section only applies to entries of the target code sequence tables below that reference this section.
-
-This section covers properties of barrier operation on *workgroup barrier objects* implemented in AMDGPU
-hardware.
-
-The following barrier operations can never be performed by the shader on *workgroup barrier objects*.
-The hardware will instead perform them automatically under certain conditions.
-
-* Barrier *init*:
-
-  * The hardware automatically initializes *workgroup barrier objects* when a workgroup is launched:
-    The *expected count* of the barrier object is set to the number of waves in the workgroup.
-
-* Barrier *join*:
-
-  * Any thread launched within a workgroup automatically *joins* *workgroup barrier objects*.
-
-* Barrier *leave*
-
-  * When a thread ends, it automatically *leaves* any *workgroup barrier object* it had previously *joined*.
-
-Additionally, no barrier *wait* operation on a *workgroup barrier object* can complete before all waves of
-the workgroup have launched.
-
 Informational Notes
 +++++++++++++++++++
 
@@ -6948,121 +6981,233 @@ Informally, we can deduce from the above formal model that execution barriers be
 * *Barrier-executes-before* relates the dynamic instances of operations from different threads together.
   For example, if ``A -> B`` in *barrier-executes-before*, then the execution of ``A`` must complete
   before the execution of ``B`` can complete.
-* When a barrier *signal* or *leave* causes the *signal count* of a barrier *object* to be identical to the
-  *expected count*, the *signal count* is reset to zero, and threads that have *joined* the barrier *object*
-  will:
 
-  * Wake-up if they were sleeping because of a barrier *wait*, **or**
-  * Skip the next barrier *wait* operation if they have not previously *waited*.
+  * This property can also be combined with *program-order*. For example, let two (non-barrier) operations
+    ``X`` and ``Y`` where ``X -> A`` and ``B -> Y`` in *program-order*, then we know that the execution
+    of ``X`` completes before the execution of ``Y`` does.
 
 * Barriers do not complete "out-of-thin-air"; a barrier *wait* ``W`` cannot depend on a barrier operation
   ``X`` to complete if ``W -> X`` in *barrier-executes-before*.
-* It is undefined behavior to operate on an uninitialized barrier.
+* It is undefined behavior to operate on an uninitialized barrier object.
 * It is undefined behavior for a barrier *wait* to never complete.
+* It is not mandatory to *drop* a barrier after *joining* it. The operations are not opposites; *drop*
+  affects future barrier operations by decrementing the *expected count* of the barrier *object*, which
+  can only be undone by re-*initializing* the barrier.
+* A thread may not *arrive* at then *drop* a barrier *object* unless the barrier completes before the
+  barrier *drop*. Incrementing the *signal count* and decrementing the *expected count* directly
+  after may cause undefined behavior.
+* *Joining* a barrier is only useful if the thread will *wait* on that same barrier *object* later.
 
 Execution Barrier GFX6-11
 +++++++++++++++++++++++++
 
 Targets from GFX6 through GFX11 included do not have the split barrier feature.
-The barrier *signal* and barrier *wait* operations cannot be performed independently.
+The barrier *arrive* and barrier *wait* operations **cannot** be performed independently.
 
 There is only one *workgroup barrier* object of ``workgroup`` scope that is implicitly used
 by all barrier operations.
 
-  .. table:: AMDHSA Execution Barriers Code Sequences GFX6-GFX11
-     :name: amdgpu-amdhsa-execution-barriers-code-sequences-gfx6-gfx11-table
-
-     ===================== ====================== ===========================================================
-     Barrier Operation(s)  Barrier *Object*       AMDGPU Machine Code
-     ===================== ====================== ===========================================================
-     **Init, Join and Leave**
-     --------------------------------------------------------------------------------------------------------
-     *init*                - *Workgroup barrier*  See barrier *init* in
-                                                  :ref:`amdgpu-amdhsa-execution-barriers-workgroup-barriers`.
-
-     *join*                - *Workgroup barrier*  See barrier *join* in
-                                                  :ref:`amdgpu-amdhsa-execution-barriers-workgroup-barriers`.
-
-     *leave*               - *Workgroup barrier*  See barrier *leave* in
-                                                  :ref:`amdgpu-amdhsa-execution-barriers-workgroup-barriers`.
-
-     **Signal and Wait**
-     --------------------------------------------------------------------------------------------------------
-     *signal* then *wait*  - *Workgroup barrier*  | **BackOffBarrier**
-                                                  | ``s_barrier``
-                                                  | **No BackOffBarrier**
-                                                  | ``s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)``
-                                                  | ``s_waitcnt_vscnt null, 0x0``
-                                                  | ``s_barrier``
-
-                                                  - If the target does not have the BackOffBarrier feature,
-                                                    then there cannot be any outstanding memory operations
-                                                    before issuing the ``s_barrier`` instruction.
-                                                  - The waitcnts can independently be moved earlier, or
-                                                    removed entirely as long as the associated
-                                                    counter remains at zero before issuing the
-                                                    ``s_barrier`` instruction.
-
-     *signal*              - *Workgroup barrier*  Not available separately, see *signal* then *wait*
-
-     *wait*                - *Workgroup barrier*  Not available separately, see *signal* then *wait*
-     ===================== ====================== ===========================================================
+The following code sequences can be used to implement the barrier operations described by the above specification:
+
+.. table:: AMDHSA Execution Barriers Code Sequences GFX6-GFX11
+    :name: amdgpu-amdhsa-execution-barriers-code-sequences-gfx6-gfx11-table
+    :widths: 15 15 70
+
+    ===================== ====================== ===========================================================
+    Barrier Operation(s)  Barrier *Object*       AMDGPU Machine Code
+    ===================== ====================== ===========================================================
+    **Init, Join and Drop**
+    --------------------------------------------------------------------------------------------------------
+    *init*                - *Workgroup barrier*  Automatically initialized by the hardware when a workgroup
+                                                 is launched. The *expected count* of this barrier is set
+                                                 to the number of waves in the workgroup.
+
+    *join*                - *Workgroup barrier*  Any thread launched within a workgroup automatically *joins*
+                                                 this barrier *object*.
+
+    *drop*                - *Workgroup barrier*  When a thread ends, it automatically *drops* this barrier
+                                                 *object* if it had previously *joined* it.
+
+    **Arrive and Wait**
+    --------------------------------------------------------------------------------------------------------
+    *arrive* then *wait*  - *Workgroup barrier*  | **BackOffBarrier**
+                                                 | ``s_barrier``
+                                                 | **No BackOffBarrier**
+                                                 | ``s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)``
+                                                 | ``s_waitcnt_vscnt null, 0x0``
+                                                 | ``s_barrier``
+
+                                                 - If the target does not have the BackOffBarrier feature,
+                                                   then there cannot be any outstanding memory operations
+                                                   before issuing the ``s_barrier`` instruction.
+                                                 - The waitcnts can independently be moved earlier, or
+                                                   removed entirely as long as the associated
+                                                   counter remains at zero before issuing the
+                                                   ``s_barrier`` instruction.
+                                                 - The ``s_barrier`` instruction cannot complete
+                                                   before all waves of the workgroup have launched.
+
+    *arrive*              - *Workgroup barrier*  Not available separately, see *arrive* then *wait*
+
+    *wait*                - *Workgroup barrier*  Not available separately, see *arrive* then *wait*
+    ===================== ====================== ===========================================================
 
 Execution Barrier GFX12
 +++++++++++++++++++++++
 
-.. note::
-
-  This is incomplete for GFX12.5.
-
 GFX12 targets have the split-barrier feature, and also offer multiple barrier *objects* per workgroup
-(see :ref:`amdgpu-amdhsa-execution-barriers-ids-gfx12-table`).
-
-  .. table:: AMDHSA Execution Barriers Code Sequences GFX12
-     :name: amdgpu-amdhsa-execution-barriers-code-sequences-gfx12-table
-
-     ===================== =========================== ===========================================================
-     Barrier Operation(s)  Barrier *Object*            AMDGPU Machine Code
-     ===================== =========================== ===========================================================
-     **Init, Join and Leave**
-     -------------------------------------------------------------------------------------------------------------
-     *init*                - *Workgroup barrier*       See barrier *init* in
-                           - *Workgroup trap barrier*  :ref:`amdgpu-amdhsa-execution-barriers-workgroup-barriers`.
-
-     *join*                - *Workgroup barrier*       See barrier *join* in
-                           - *Workgroup trap barrier*  :ref:`amdgpu-amdhsa-execution-barriers-workgroup-barriers`.
-
-     *leave*               - *Workgroup barrier*       See barrier *leave* in
-                           - *Workgroup trap barrier*  :ref:`amdgpu-amdhsa-execution-barriers-workgroup-barriers`.
-
-     **Signal and Wait**
-     -------------------------------------------------------------------------------------------------------------
-
-     *signal*              - *Workgroup barrier*       | ``s_barrier_signal -1``
-                                                       | Or
-                                                       | ``s_barrier_signal_isfirst -1``
-
+(see :ref:`amdgpu-amdhsa-execution-barriers-ids-gfx12-table`). Each barrier *object* has a unique barrier ID that
+instructions use to operate on them.
 
-     *wait*                - *Workgroup barrier*       ``s_barrier_wait -1``.
+GFX12.5 additionally introduces new barrier *objects* that offer more flexibility for synchronizing the execution
+of a subset of waves of a workgroup, or synchronizing execution across workgroups within a workgroup cluster.
 
-     *signal*              - *Workgroup trap barrier*  Not available to the shader.
-
-     *wait*                - *Workgroup trap barrier*  Not available to the shader.
-     ===================== =========================== ===========================================================
-
-
-  .. table:: AMDHSA Execution Barriers IDs GFX12
-     :name: amdgpu-amdhsa-execution-barriers-ids-gfx12-table
-
-     =========== ============== ==============================================================
-     Barrier ID  Scope          Description
-     =========== ============== ==============================================================
-     ``-2``      ``workgroup``  *Workgroup trap barrier*, dedicated for the trap handler and
-                                only available in privileged execution mode
-                                (not accessible by the shader).
+.. note::
 
-     ``-1``      ``workgroup``  *Workgroup barrier*.
-     =========== ============== ==============================================================
+  Check the :ref:`the table below<amdgpu-amdhsa-execution-barriers-ids-gfx12-table>` to determine which barrier IDs are
+  available to the shader on a given target.
+
+
+The following code sequences can be used to implement the barrier operations described by the above specification:
+
+.. table:: AMDHSA Execution Barriers Code Sequences GFX12
+    :name: amdgpu-amdhsa-execution-barriers-code-sequences-gfx12-table
+    :widths: 15 15 70
+
+    ===================== =========================== ===========================================================
+    Barrier Operation(s)  Barrier ID                  AMDGPU Machine Code
+    ===================== =========================== ===========================================================
+    **Init, Join and Drop**
+    -------------------------------------------------------------------------------------------------------------
+    *init*                - ``-2``, ``-1``            Automatically initialized by the hardware when a workgroup
+                                                      is launched. The *expected count* of this barrier is set
+                                                      to the number of waves in the workgroup.
+
+    *init*                - ``-4``, ``-3``            Automatically initialized by the hardware when a workgroup
+                                                      is launched as part of a workgroup cluster.
+                                                      The *expected count* of this barrier is set to the number
+                                                      of workgroups in the workgroup cluster.
+
+    *init*                - ``0``                     Automatically initialized by the hardware and always
+                                                      available. This barrier *object* is opaque and immutable
+                                                      as all operations other than barrier *join* are no-ops.
+
+    *init*                - ``[1, 16]``               | ``s_barrier_init <N>``
+
+                                                      - ``<N>`` is an immediate constant, or stored in the lower
+                                                        half of ``m0``.
+                                                      - The value to set as the *expected count* of the barrier
+                                                        is stored in the upper half of ``m0``.
+
+    *join*                - ``-2``, ``-1``            Any thread launched within a workgroup automatically *joins*
+                                                      this barrier *object*.
+
+    *join*                - ``-4``, ``-3``            Any thread launched within a workgroup cluster
+                                                      automatically *joins* this barrier *object*.
+
+    *join*                - ``0``                     | ``s_barrier_join <N>``
+                          - ``[1, 16]``
+                                                      - ``<N>`` is an immediate constant, or stored in the lower
+                                                        half of ``m0``.
+
+    *drop*                - ``0``                     | ``s_barrier_leave``
+                          - ``[1, 16]``
+                                                      - ``s_barrier_leave`` takes no operand. It can only be used
+                                                        to *drop* a barrier *object* ``BO`` if ``BO`` was
+                                                        previously *joined* using ``s_barrier_join``.
+                                                      - *Drops* the barrier *object* ``BO`` if and only if
+                                                        there is a barrier *join* ``J`` such that ``J`` is
+                                                        *barrier-joined-before* this barrier
+                                                        *drop* operation.
+
+    *drop*                - ``-2``, ``-1``            When a thread ends, it automatically *drops* this barrier
+                          - ``-4``, ``-3``            *object* if it had previously *joined* it.
+
+    **Arrive and Wait**
+    -------------------------------------------------------------------------------------------------------------
+
+    *arrive*              - ``-4``, ``-3``            | ``s_barrier_signal <N>``
+                          - ``-2``, ``-1``            | Or
+                          - ``0``                     | ``s_barrier_signal_isfirst <N>``
+                          - ``[1, 16]``
+                                                      - ``<N>`` is an immediate constant, or stored in bits ``[4:0]`` of ``m0``.
+                                                      - The ``_isfirst`` variant sets ``SCC=1`` if this wave is the first
+                                                        to signal the barrier, otherwise ``SCC=0``.
+                                                      - For barrier *objects* ``[1, 16]``: When using ``m0`` as an operand,
+                                                        if there is a non-zero value contained in the bits ``[22:16]`` of ``m0``,
+                                                        the *expected count* of the barrier *object* is set to that value before
+                                                        the *arrive count* of the barrier *object* is incremented.
+                                                        The new *expected count* value must be greater than or equal to the old
+                                                        value, otherwise the behavior is undefined.
+                                                      - For barrier *objects* ``-4`` and ``-3``
+                                                        (``cluster`` barriers): only one wave
+                                                        per workgroup may arrive at the barrier on behalf of
+                                                        its entire workgroup. However, any wave within the workgroup
+                                                        cluster can then *wait* on this barrier *object*.
+                                                      - This is a no-op on the *NULL named barrier object*
+                                                        (barrier *object* ``0``).
+
+    *wait*                - ``-4``, ``-3``            ``s_barrier_wait <N>``.
+                          - ``-2``, ``-1``
+                          - ``0``                     - ``<N>`` is an immediate constant.
+                          - ``[1, 16]``               - For barrier *objects* ``-2`` and ``-1``: This instruction
+                                                        cannot complete before all waves of the
+                                                        workgroup have launched.
+                                                      - For barrier *objects* ``-4`` and ``-3`` (``cluster`` barriers):
+                                                        This instruction cannot complete before all waves of the
+                                                        workgroup cluster have launched.
+                                                      - This is a no-op on the *NULL named barrier object*
+                                                        (barrier *object* ``0``).
+                                                      - For *named barrier objects*, this instruction always waits on the
+                                                        last *named barrier object* that the thread has *joined*, even
+                                                        if it is different from the *barrier object* passed to the
+                                                        instruction.
+    ===================== =========================== ===========================================================
+
+
+The following barrier IDs are available:
+
+.. table:: AMDHSA Execution Barriers IDs GFX12
+    :name: amdgpu-amdhsa-execution-barriers-ids-gfx12-table
+    :widths: 15 15 15 55
+
+    =============== ============== ============ ==============================================================
+    Barrier ID      Scope          Availability Description
+    =============== ============== ============ ==============================================================
+    ``-4``          ``cluster``    GFX12.5      *Cluster trap barrier*; *cluster barrier object* for use by
+                                                all workgroups of a workgroup cluster. Dedicated for the trap
+                                                handler and only available in privileged execution mode
+                                                (not accessible by the shader).
+
+    ``-3``          ``cluster``    GFX12.5      *Cluster user barrier*; *cluster barrier object* for use by
+                                                all workgroups of a workgroup cluster.
+
+    ``-2``          ``workgroup``  GFX12 (all)  *Workgroup trap barrier*, dedicated for the trap handler and
+                                                only available in privileged execution mode
+                                                (not accessible by the shader).
+
+    ``-1``          ``workgroup``  GFX12 (all)  *Workgroup barrier*.
+
+    ``0``           ``workgroup``  GFX12.5      *NULL named barrier object*. *Barrier-mutually-exclusive* with
+                                                barriers ``[1, 16]``.
+
+    ``[1, 16]``     ``workgroup``  GFX12.5      *Named barrier object*. All barrier *objects* in this range are
+                                                *barrier-mutually-exclusive* with barriers ``[0, 16]``.
+    =============== ============== ============ ==============================================================
+
+
+
+Informally, we can note that:
+
+* All operations other than *join* on the *NULL named barrier object* is a no-op.
+
+  * As the *NULL named barrier object* (barrier ID ``0``) is *barrier-mutually-exclusive* with all other
+    *named barrier objects* (barrier IDs ``[1, 16]``), a thread can use a *join* on the *NULL*
+    barrier as a way to "unjoin" a *named barrier* (break *barrier-joined-before*) without
+    having to use a *drop* operation.
+
+* When a thread ends, it does not implicitly *drop* any *named barrier objects*
+  (barrier IDs ``[0, 16]``) it has *joined*.
 
 .. _amdgpu-amdhsa-memory-model:
 
@@ -7206,8 +7351,8 @@ Execution Barriers
   See :ref:`amdgpu-amdhsa-execution-barriers` for definitions of the terminology used
   in this section.
 
-* A barrier *signal* operation ``S`` can pair with a release fence program-ordered before it
-  to form a ``barrier-signal-release`` ``BR``. The synchronization scope and the set of address
+* A barrier *arrive* operation ``A`` can pair with a release fence program-ordered before it
+  to form a ``barrier-arrive-release`` ``BR``. The synchronization scope and the set of address
   spaces affected are determined by the release fence.
 * A barrier *wait* operation ``W`` can pair with an acquire fence program-ordered after it to
   form a ``barrier-wait-acquire`` ``BA``. The synchronization scope and the set of address
@@ -7215,12 +7360,16 @@ Execution Barriers
 
 A ``BR`` *synchronizes-with* ``BA`` in an address space *AS* if and only if:
 
-* ``S`` *barrier-executes-before* ``W``.
+* ``A`` *barrier-executes-before* ``W``.
 * *BA* and *BR*'s :ref:`synchronization scope<amdgpu-memory-scopes>` overlap.
 * *BA* and *BR*'s :ref:`synchronization scope<amdgpu-memory-scopes>`
   allow cross address space synchronization (they cannot have ``one-as``) :sup:`1`.
 * *BA* and *BR*'s address spaces both include *AS*.
 
+Informally, we can deduce from the above rules that:
+
+* *Happens-before* is consistent with *barrier-executes-before*.
+
 :sup:`1`: This is a requirement due to how current hardware implements barrier operations.
 This limitation may be lifted in the future.
 



More information about the llvm-commits mailing list