[llvm] [Doc][AMDGPU] Add barrier execution & memory model (PR #170447)

Tue Dec 9 01:51:28 PST 2025

================
@@ -6553,6 +6567,297 @@ The Private Segment Buffer is always requested, but the Private Segment
 Wavefront Offset is only requested if it is used (see
 :ref:`amdgpu-amdhsa-initial-kernel-execution-state`).
 
+.. _amdgpu-amdhsa-execution-barriers:
+
+Execution Barriers
+~~~~~~~~~~~~~~~~~~
+
+.. note::
+
+  This specification is a work-in-progress (see lines annotated with :sup:`WIP`), and is not complete for GFX12.5.
+
+Threads can synchronize execution by performing barrier operations on barrier *objects* as described below:
+
+* Barrier *objects* have the following state:
+
+  * An unsigned non-zero positive integer *expected count*: counts the number of *signal* operations
+    expected for this barrier *object*.
+  * An unsigned positive integer *signal count*: counts the number of *signal* operations
+    already performed on this barrier *object*.
+
+      * The initial value of *signal count* is zero.
+      * When an operation causes *signal count* to be equal to *expected count*, the barrier is completed,
+        and the *signal count* is reset to zero.
+
+* Barrier operations are performed on barrier *objects*. A barrier operation is a dynamic instance
+  of one of the following:
+
+  * Barrier *init*.
+  * Barrier *join*.
+  * Barrier *leave*: decrements *expected count* of the barrier *object* by one.
+  * Barrier *signal*: increments *signal count* of the barrier *object* by one.
+  * Barrier *wait*.
+
+* Barrier modification operations are barrier operations that modify the barrier *object* state:
+
+  * Barrier *init*.
+  * Barrier *leave*.
+  * Barrier *signal*.
+
+* For a given barrier *object* ``BO``:
+
+  * There is exactly one barrier *init* for ``BO``. :sup:`WIP`
+  * *Thread-barrier-order<BO>* is the subset of *program-order* that only
+    relates barrier operations performed on ``BO``.
+  * Let ``S`` be the set of barrier modification operations on ``BO``, then
+    *barrier-modification-order<BO>* is a strict total order over ``S``. It is the order
+    in which ``BO`` observes barrier operations that change its state.
+
+    * *Barrier-modification-order<BO>* is consistent with *happens-before*.
+    * The first element in *barrier-modification-order<BO>* is a barrier *init*.
+      There is only one barrier *init* in *barrier-modification-order<BO>*
+
+  * *Barrier-joined-before<BO>* is a strict partial order over barrier operations on ``BO``.
+    A barrier *join* ``J`` is *barrier-joined-before<BO>* a barrier operation ``X`` if and only if all
+    of the following is true:
+
+    * ``J -> X`` in *thread-barrier-order<BO>*.
+    * There is no barrier *leave* ``L`` where ``J -> L -> X`` in *thread-barrier-order<BO>*.
+
+  * *Barrier-participates-in<BO>* is a partial order that relates barrier operations to barrier *waits*.
+    A barrier operation ``X`` may *barrier-participates-in<BO>* a barrier *wait* ``W`` if all of the following
+    is true:
+
+    * ``X`` and ``W`` are both performed on ``BO``.
+    * ``X`` is a barrier *signal* or *leave* operation.
+    * ``X`` does not *barrier-participates-in<BO>* another barrier *wait* ``W'`` in the same thread as ``W``.
+    * ``W -> X`` **not** in *thread-barrier-order<BO>*.
+
+* Let ``S`` be the set of barrier operations that *barrier-participates-in<BO>* a barrier *wait* ``W`` for some
+  barrier *object* ``BO``, then all of the following is true:
+
+  * ``S`` cannot be empty. :sup:`WIP`
+  * The elements of ``S`` are all ordered by a continuous interval of *barrier-modification-order<BO>*.
+  * Let ``A`` be the first operation of ``S`` in *barrier-modification-order<BO>*, then the *signal count* of ``BO``
+    is zero before ``A`` is performed.
+  * Let ``B`` be the last operation of ``S`` in *barrier-modification-order<BO>*, then the *signal count* and
+    *expected count* of ``BO`` are equal after ``B`` is performed. ``B`` is the only barrier operation in ``S``
+    that causes the *signal count* and *expected count* of ``BO`` to be equal.
+
+* For every barrier *signal* ``S`` performed on a barrier *object* ``BO``:
+
+  * The immediate successor of ``S`` in *thread-barrier-order<BO>* is a barrier *wait*. :sup:`WIP`
+
+* For every barrier *wait* ``W`` performed on a barrier *object* ``BO``:
+
+  * There is a barrier *join* ``J`` such that ``J -> W`` in *barrier-joined-before<BO>*. :sup:`WIP`
+
+* For every barrier *join* ``J`` performed on a barrier *object* ``BO``:
+
+  * There is no other barrier operation *thread-barrier-ordered<BO>* before ``J``. :sup:`WIP`
+  * ``J`` is not *barrier-joined-before<BO>* another barrier *join*.
+
+* For every barrier *leave* ``L`` performed on a barrier *object* ``BO``:
+
+  * There is no other barrier operation *thread-barrier-ordered<BO>* after ``L``. :sup:`WIP`
+
+* *Barrier-executes-before* is a strict partial order of all barrier operations. It is the transitive closure of all
+  the following orders:
+
+  * *Thread-barrier-order<BO>* for every barrier object ``BO``.
+  * *Barrier-participates-in<BO>* for every barrier object ``BO``.
+
+* *Barrier-executes-before* is consistent with *program-order*.
+
+*Barrier-executes-before* represents the order in which barrier operations will complete by relating operations
+from different threads together.
+For example, if ``A -> B`` in *barrier-executes-before*, then the execution of ``A`` must complete
+before the execution of ``B`` can complete.
+
+When a barrier *signal* ``S`` *barrier-executes-before* a barrier *wait* ``W``, ``S`` executes before ``W``
+**as-if** ``S`` is *program-ordered* before ``W``. Thus, all *dynamic instances* *program-ordered* before ``S``
+are known to have been executed before the *dynamic instances* *program-ordered* after ``W``.
+
+  .. note::
+
+    Barriers only synchronize execution, not memory: ``S -> W`` in *barrier-executes-before* does not imply
+    ``S`` *happens-before* ``W``. Refer to the :ref:`execution barriers memory model<amdgpu-amdhsa-execution-barriers-memory-model>`
+    to also synchronize memory.
+
+Target-Specific Properties
+++++++++++++++++++++++++++
+
+This section covers properties of barrier operation and *objects* that are specific to the implementation of
+barriers in AMDGPU hardware.
+
+Barrier operations have the following additional target-specific properties:
+
+* Barrier operations are convergent within a wave. All threads of a wavefront use the same barrier *object* when
+  performing any barrier operation.
+
+All barrier *objects* have the following additional target-specific properties:
+
+* Barrier *join* does not increment the *expected count* of a barrier *object*. The *expected count* is set
+  during initialization of the barrier by the hardware. :sup:`WIP`
+* Barrier *objects* are allocated and managed by the hardware.
+
+  * Barrier *objects* are stored in an inaccessible memory location.
+
+* Barrier *objects* exist within a *scope* (see :ref:`amdgpu-amdhsa-llvm-sync-scopes-table`),
+  and can only be accessed by threads in that *scope*.
+
+See :ref:`amdgpu-llvm-ir-intrinsics-table` for more information on how to perform barrier operations using
+LLVM IR intrinsic calls, or see the sections below to perform barrier operations using machine code.
+
+.. _amdgpu-amdhsa-execution-barriers-workgroup-barriers:
+
+Workgroup Barrier Operations
+############################
+
+.. note::
+
+  This section only applies when it is referenced by one of the target code sequence table below.
+
+This section covers properties of barrier operation on *workgroup barrier objects* implemented in AMDGPU
+hardware.
+
+The following barrier operations can never be performed by the shader on *workgroup barrier objects*.
+The hardware will instead perform them automatically under certain conditions.
+
+* Barrier *init*:
+
+  * The hardware automatically initializes *workgroup barrier objects* when a workgroup is launched:
+    The *expected count* of the barrier object is set to the number of waves in the workgroup
+  * The number of waves in the workgroup is not known until all waves of the workgroup have launched.
+    Thus, the *expected count* of *workgrou barrier object* can never be equal to its *signal count*
+    until all wavefronts of the workgroup launched.
----------------
Pierre-vh wrote:

I believe this property is important to mention, but I see how it's confusing. I removed the bullet point and put it at the bottom of the section as an additional note about how barrier *wait* behaves. 

> The question is whether one wave can signal the barrier twice in succession.

No, it can't, because of the rule you mentioned. We always need a wait after a signal.

But this is really not important, the gist of it is that a barrier wait cannot wakeup until all waves launched. I reframed this as a property of the barrier wait at the bottom of the section, which I hope clears up the confusion

> As a consequence, we'd have wave-barrier-order instead of thread-barrier-order.

This is exactly how the specification started; I thought the same. After numerous discussions within our working group, we moved towards a thread-oriented specification. I can't remember the exact reasons though. I think it had to do with keeping in-line with other models which are all thread-centric (convergence spec, memory models, etc.).
Making this spec wave-centric makes it unnecessarily hard to link everything together. e.g. I can't say "program-order" anymore unless I define what "program-order" means within a wave (and how it works with divergent control flow, and so on).

> barrier operations can only appear in wave-uniform control flow, and

I believe the statement about `Barrier operations are convergent within a wave` is already enough to guarantee this. @ssahasra is that correct or do I need another statement to explicitly forbid using a barrier within non-uniform control flow ?

> And there is also the question of what happens to the generic wording if we ever do add a per-thread barrier.

The generic wording/formal model isn't fixed in place. It can and will evolve over time. As long as old behavior is still supported, we can introduce new relations, decompose existing relations into multiple components, and so on.

https://github.com/llvm/llvm-project/pull/170447