[llvm] [Doc][AMDGPU] Add barrier execution & memory model (PR #170447)

Fri Dec 5 12:50:39 PST 2025

================
@@ -6553,6 +6567,297 @@ The Private Segment Buffer is always requested, but the Private Segment
 Wavefront Offset is only requested if it is used (see
 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 
+.. _amdgpu-amdhsa-execution-barriers:
+
+Execution Barriers
+~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+  This specification is a work-in-progress (see lines annotated with :sup:`WIP`), and is not complete for GFX12.5.
+
+Threads can synchronize execution by performing barrier operations on barrier *objects* as described below:
+
+* Barrier *objects* have the following state:
+
+  * An unsigned non-zero positive integer *expected count*: counts the number of *signal* operations
+    expected for this barrier *object*.
+  * An unsigned positive integer *signal count*: counts the number of *signal* operations
+    already performed on this barrier *object*.
+
+      * The initial value of *signal count* is zero.
+      * When an operation causes *signal count* to be equal to *expected count*, the barrier is completed,
+        and the *signal count* is reset to zero.
+
+* Barrier operations are performed on barrier *objects*. A barrier operation is a dynamic instance
+  of one of the following:
+
+  * Barrier *init*.
+  * Barrier *join*.
+  * Barrier *leave*: decrements *expected count* of the barrier *object* by one.
+  * Barrier *signal*: increments *signal count* of the barrier *object* by one.
+  * Barrier *wait*.
+
+* Barrier modification operations are barrier operations that modify the barrier *object* state:
+
+  * Barrier *init*.
+  * Barrier *leave*.
+  * Barrier *signal*.
+
+* For a given barrier *object* ``BO``:
+
+  * There is exactly one barrier *init* for ``BO``. :sup:`WIP`
+  * *Thread-barrier-order<BO>* is the subset of *program-order* that only
+    relates barrier operations performed on ``BO``.
+  * Let ``S`` be the set of barrier modification operations on ``BO``, then
+    *barrier-modification-order<BO>* is a strict total order over ``S``. It is the order
+    in which ``BO`` observes barrier operations that change its state.
+
+    * *Barrier-modification-order<BO>* is consistent with *happens-before*.
+    * The first element in *barrier-modification-order<BO>* is a barrier *init*.
+      There is only one barrier *init* in *barrier-modification-order<BO>*
+
+  * *Barrier-joined-before<BO>* is a strict partial order over barrier operations on ``BO``.
+    A barrier *join* ``J`` is *barrier-joined-before<BO>* a barrier operation ``X`` if and only if all
+    of the following is true:
+
+    * ``J -> X`` in *thread-barrier-order<BO>*.
+    * There is no barrier *leave* ``L`` where ``J -> L -> X`` in *thread-barrier-order<BO>*.
+
+  * *Barrier-participates-in<BO>* is a partial order that relates barrier operations to barrier *waits*.
+    A barrier operation ``X`` may *barrier-participates-in<BO>* a barrier *wait* ``W`` if all of the following
+    is true:
+
+    * ``X`` and ``W`` are both performed on ``BO``.
+    * ``X`` is a barrier *signal* or *leave* operation.
+    * ``X`` does not *barrier-participates-in<BO>* another barrier *wait* ``W'`` in the same thread as ``W``.
+    * ``W -> X`` **not** in *thread-barrier-order<BO>*.
+
+* Let ``S`` be the set of barrier operations that *barrier-participates-in<BO>* a barrier *wait* ``W`` for some
+  barrier *object* ``BO``, then all of the following is true:
+
+  * ``S`` cannot be empty. :sup:`WIP`
+  * The elements of ``S`` are all ordered by a continuous interval of *barrier-modification-order<BO>*.
+  * Let ``A`` be the first operation of ``S`` in *barrier-modification-order<BO>*, then the *signal count* of ``BO``
+    is zero before ``A`` is performed.
+  * Let ``B`` be the last operation of ``S`` in *barrier-modification-order<BO>*, then the *signal count* and
+    *expected count* of ``BO`` are equal after ``B`` is performed. ``B`` is the only barrier operation in ``S``
+    that causes the *signal count* and *expected count* of ``BO`` to be equal.
+
+* For every barrier *signal* ``S`` performed on a barrier *object* ``BO``:
+
+  * The immediate successor of ``S`` in *thread-barrier-order<BO>* is a barrier *wait*. :sup:`WIP`
+
+* For every barrier *wait* ``W`` performed on a barrier *object* ``BO``:
+
+  * There is a barrier *join* ``J`` such that ``J -> W`` in *barrier-joined-before<BO>*. :sup:`WIP`
+
+* For every barrier *join* ``J`` performed on a barrier *object* ``BO``:
+
+  * There is no other barrier operation *thread-barrier-ordered<BO>* before ``J``. :sup:`WIP`
+  * ``J`` is not *barrier-joined-before<BO>* another barrier *join*.
+
+* For every barrier *leave* ``L`` performed on a barrier *object* ``BO``:
+
+  * There is no other barrier operation *thread-barrier-ordered<BO>* after ``L``. :sup:`WIP`
+
+* *Barrier-executes-before* is a strict partial order of all barrier operations. It is the transitive closure of all
+  the following orders:
+
+  * *Thread-barrier-order<BO>* for every barrier object ``BO``.
+  * *Barrier-participates-in<BO>* for every barrier object ``BO``.
+
+* *Barrier-executes-before* is consistent with *program-order*.
+
+*Barrier-executes-before* represents the order in which barrier operations will complete by relating operations
+from different threads together.
+For example, if ``A -> B`` in *barrier-executes-before*, then the execution of ``A`` must complete
+before the execution of ``B`` can complete.
+
+When a barrier *signal* ``S`` *barrier-executes-before* a barrier *wait* ``W``, ``S`` executes before ``W``
+**as-if** ``S`` is *program-ordered* before ``W``. Thus, all *dynamic instances* *program-ordered* before ``S``
+are known to have been executed before the *dynamic instances* *program-ordered* after ``W``.
+
+  .. note::
+
+    Barriers only synchronize execution, not memory: ``S -> W`` in *barrier-executes-before* does not imply
+    ``S`` *happens-before* ``W``. Refer to the :ref:`execution barriers memory model<amdgpu-amdhsa-execution-barriers-memory-model>`
+    to also synchronize memory.
+
+Target-Specific Properties
+++++++++++++++++++++++++++
+
+This section covers properties of barrier operation and *objects* that are specific to the implementation of
+barriers in AMDGPU hardware.
+
+Barrier operations have the following additional target-specific properties:
+
+* Barrier operations are convergent within a wave. All threads of a wavefront use the same barrier *object* when
+  performing any barrier operation.
+
+All barrier *objects* have the following additional target-specific properties:
+
+* Barrier *join* does not increment the *expected count* of a barrier *object*. The *expected count* is set
+  during initialization of the barrier by the hardware. :sup:`WIP`
+* Barrier *objects* are allocated and managed by the hardware.
+
+  * Barrier *objects* are stored in an inaccessible memory location.
+
+* Barrier *objects* exist within a *scope* (see :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`),
+  and can only be accessed by threads in that *scope*.
+
+See :ref:`amdgpu-llvm-ir-intrinsics-table` for more information on how to perform barrier operations using
+LLVM IR intrinsic calls, or see the sections below to perform barrier operations using machine code.
+
+.. _amdgpu-amdhsa-execution-barriers-workgroup-barriers:
+
+Workgroup Barrier Operations
+############################
+
+.. note::
+
+  This section only applies when it is referenced by one of the target code sequence table below.
+
+This section covers properties of barrier operation on *workgroup barrier objects* implemented in AMDGPU
+hardware.
+
+The following barrier operations can never be performed by the shader on *workgroup barrier objects*.
+The hardware will instead perform them automatically under certain conditions.
+
+* Barrier *init*:
+
+  * The hardware automatically initializes *workgroup barrier objects* when a workgroup is launched:
+    The *expected count* of the barrier object is set to the number of waves in the workgroup
+  * The number of waves in the workgroup is not known until all waves of the workgroup have launched.
+    Thus, the *expected count* of *workgrou barrier object* can never be equal to its *signal count*
+    until all wavefronts of the workgroup launched.
----------------
nhaehnle wrote:

Typo: workgrou

I'd remove the first sentence because (1) it is too much of a hardware detail and (2) it kind of contradicts what the previous bullet says.

And is the second sentence really true? The question is whether one wave can signal the barrier twice in succession.

On the one hand, I'd say yes because of the rule that:

> * For every barrier *signal* ``S`` performed on a barrier *object* ``BO``:
>
>    * The immediate successor of ``S`` in *thread-barrier-order<BO>* is a barrier *wait*. :sup:`WIP`

On the other hand, what if barrier operations are performed in non-uniform control flow? I don't know or recall whether that came up in internal discussions. Having thought about this stuff quite extensively in the past, one approach is to declare that:

* barrier operations can only appear in wave-uniform control flow, and
* there is only *one* operation per wave and not one per thread. (This is also closer to the ground truth of what happens in the hardware)

As a consequence, we'd have wave-barrier-order<BO> instead of thread-barrier-order<BO>.

On the other hand, there is then the difficulty of tying the wave-barrier-order back to the memory model orders that are specified in a per-thread manner. And there is also the question of what happens to the generic wording if we ever do add a per-thread barrier.

How about calling it an actor(?)-barrier-order in the generic wording? The per-thread barrier operations only really come into play when linking this into the memory model.

https://github.com/llvm/llvm-project/pull/170447