[llvm] f81cc8b - [AMDGPU] Update gfx1250 documentation. NFC (#160457)
via llvm-commits
llvm-commits at lists.llvm.org
Wed Sep 24 10:04:51 PDT 2025
Author: Stanislav Mekhanoshin
Date: 2025-09-24T10:04:47-07:00
New Revision: f81cc8bddcbc3561dbf9baa9ba48ffdae2443f3b
URL: https://github.com/llvm/llvm-project/commit/f81cc8bddcbc3561dbf9baa9ba48ffdae2443f3b
DIFF: https://github.com/llvm/llvm-project/commit/f81cc8bddcbc3561dbf9baa9ba48ffdae2443f3b.diff
LOG: [AMDGPU] Update gfx1250 documentation. NFC (#160457)
Added:
Modified:
llvm/docs/AMDGPUUsage.rst
Removed:
################################################################################
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index edabdc595a1f0..74b7604fda56d 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -979,11 +979,13 @@ supported for the ``amdgcn`` target.
access is not supported except by flat and scratch instructions in
GFX9-GFX11.
- Code that manipulates the stack values in other lanes of a wavefront,
- such as by ``addrspacecast``-ing stack pointers to generic ones and taking offsets
- that reach other lanes or by explicitly constructing the scratch buffer descriptor,
- triggers undefined behavior when it modifies the scratch values of other lanes.
- The compiler may assume that such modifications do not occur.
+ On targets without "Globally Accessible Scratch" (introduced in GFX125x), code that
+ manipulates the stack values in other lanes of a wavefront, such as by
+ ``addrspacecast``-ing stack pointers to generic ones and taking offsets that reach other
+ lanes or by explicitly constructing the scratch buffer descriptor, triggers undefined
+ behavior when it modifies the scratch values of other lanes. The compiler may assume
+ that such modifications do not occur for such targets.
+
When using code object V5 ``LIBOMPTARGET_STACK_SIZE`` may be used to provide the
private segment size in bytes, for cases where a dynamic stack is used.
@@ -1515,6 +1517,88 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
List AMDGPU intrinsics.
+'``llvm.amdgcn.cooperative.atomic``' Intrinsics
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``llvm.amdgcn.cooperative.atomic`` :ref:`family of intrinsics<amdgpu-cooperative-atomic-intrinsics-table>`
+provide atomic load and store operations to a naturally-aligned contiguous memory regions.
+Memory is accessed cooperatively by a collection of convergent threads, with each thread accessing
+a fraction of the contiguous memory region.
+
+ .. TODO::
+
+ The memory model described here is imprecise; see SWDEV-536264.
+
+This intrinsic has a memory ordering and may be used to synchronize-with another cooperative atomic.
+If the memory ordering is relaxed, it may pair with a fence if that same fence is executed by
+all participating threads with the same synchronization scope and set of address spaces.
+
+In both cases, a synchronize-with relation can only be established between cooperative atomics with the
+same total access size.
+
+Each target may have additional restrictions on how the intrinsic may be used; see
+:ref:`the table below<amdgpu-llvm-ir-cooperative-atomic-intrinsics-availability>`.
+Targets not covered in the table do not support these intrinsics.
+
+ .. table:: AMDGPU Cooperative Atomic Intrinsics Availability
+ :name: amdgpu-llvm-ir-cooperative-atomic-intrinsics-availability
+
+ =============== =============================================================
+ GFX Version Target Restrictions
+ =============== =============================================================
+ GFX 12.5 :ref:`amdgpu-amdhsa-memory-model-gfx125x-cooperative-atomics`
+ =============== =============================================================
+
+If the intrinsic is used without meeting all of the above conditions, or the target-specific conditions,
+then this intrinsic causes undefined behavior.
+
+ .. table:: AMDGPU Cooperative Atomic Intrinsics
+ :name: amdgpu-cooperative-atomic-intrinsics-table
+
+ ======================================================= =========== ============ ==========
+ LLVM Intrinsic Number of Access Size Total Size
+ Threads Per Thread
+ Used
+ ======================================================= =========== ============ ==========
+ ``llvm.amdgcn.cooperative.atomic.store.32x4B`` 32 4B 128B
+
+ ``llvm.amdgcn.cooperative.atomic.load.32x4B`` 32 4B 128B
+
+ ``llvm.amdgcn.cooperative.atomic.store.16x8B`` 16 8B 128B
+
+ ``llvm.amdgcn.cooperative.atomic.load.16x8B`` 16 8B 128B
+
+ ``llvm.amdgcn.cooperative.atomic.store.8x16B`` 8 16B 128B
+
+ ``llvm.amdgcn.cooperative.atomic.load.8x16B`` 8 16B 128B
+
+ ======================================================= =========== ============ ==========
+
+The intrinsics are available for the global (``.p1`` suffix) and generic (``.p0`` suffix) address spaces.
+
+The atomic ordering operand (3rd operand for ``.store``, 2nd for ``.load``) is an integer that follows the
+C ABI encoding of atomic memory orderings. The supported values are in
+:ref:`the table below<amdgpu-cooperative-atomic-intrinsics-atomic-memory-orderings-table>`.
+
+ .. table:: AMDGPU Cooperative Atomic Intrinsics Atomic Memory Orderings
+ :name: amdgpu-cooperative-atomic-intrinsics-atomic-memory-orderings-table
+
+ ====== ================ =================================
+ Value Atomic Memory Notes
+ Ordering
+ ====== ================ =================================
+ ``0`` ``relaxed`` The default for unsupported values.
+
+ ``2`` ``acquire`` Only for ``.load``
+
+ ``3`` ``release`` Only for ``.store``
+
+ ``5`` ``seq_cst``
+ ====== ================ =================================
+
+The last argument of the intrinsic is the synchronization scope
+as a metadata string, which must be one of the supported :ref:`memory scopes<amdgpu-memory-scopes>`.
+
.. _amdgpu_metadata:
LLVM IR Metadata
@@ -1843,6 +1927,7 @@ The AMDGPU backend supports the following LLVM IR attributes.
This is only relevant on targets with cluster support.
+
================================================ ==========================================================
Calling Conventions
@@ -5261,6 +5346,9 @@ The fields used by CP for code objects before V3 also match those specified in
GFX10-GFX12 (wavefront size 32)
- max_vgpr 1..256
- max(0, ceil(vgprs_used / 8) - 1)
+ GFX125X (wavefront size 32)
+ - max_vgpr 1..1024
+ - max(0, ceil(vgprs_used / 16) - 1)
Where vgprs_used is defined
as the highest VGPR number
@@ -6491,6 +6579,7 @@ following sections:
* :ref:`amdgpu-amdhsa-memory-model-gfx942`
* :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
* :ref:`amdgpu-amdhsa-memory-model-gfx12`
+* :ref:`amdgpu-amdhsa-memory-model-gfx125x`
.. _amdgpu-fence-as:
@@ -16617,6 +16706,2022 @@ the instruction in the code sequence that references the table.
- system for OpenCL.*
============ ============ ============== ========== ================================
+.. _amdgpu-amdhsa-memory-model-gfx125x:
+
+Memory Model GFX125x
+++++++++++++++++++++++++
+
+For GFX125x:
+
+**Device Structure:**
+
+* Each agent has multiple shader engines (SE).
+* Each SE has multiple shader arrays (SA).
+* Each SA has multiple work-group processors (WGP).
+* Each WGP has 4 SIMD32 (2 SIMD32-pairs) that execute wavefronts.
+* The wavefronts for a single work-group are executed in the same
+ WGP.
+
+**Device Memory:**
+
+* Each WGP has a single write-through WGP cache (WGP$) shared by the wavefronts of the
+ work-groups executing on it. The WGP$ is divided between LDS and vector L0 memory.
+
+ * Vector L0 memory holds clean data only.
+
+* Each WGP$ has two request queues; one per SIMD32-pair.
+ Each queue can handle both LDS and vector L0 requests. Requests in one queue
+ are executed serially and in-order, but are not kept in order with the other queue.
+* The scalar memory operations access a scalar L0 cache shared by all wavefronts
+ on a WGP. The scalar and vector L0 caches are not kept coherent by hardware. However, scalar
+ operations are used in a restricted way so do not impact the memory model. See
+ :ref:`amdgpu-amdhsa-memory-spaces`.
+* The vector and scalar memory L0 caches are both clients of an L1 buffer shared by
+ all WGPs on the same SE.
+* L1 buffers have separate request queues for each WGP$ it serves. Requests in one queue
+ are executed serially and in-order, but are not kept in order with other queues.
+* L1 buffers are clients of the L2 cache.
+* There may be multiple L2 caches per agent. Ranges of virtual addresses can be set up as follows:
+
+ * Be non-hardware-coherent; copies of the data are not coherent between multiple L2s.
+ * Be read-write hardware-coherent with other L2 caches on the same or other agents.
+ * Bypass L2 entirely to ensure system coherence.
+
+* L2 caches have multiple memory channels to service disjoint ranges of virtual
+ addresses.
+
+**Memory Model:**
+
+.. note::
+
+ This section is currently incomplete as work on the compiler is still ongoing.
+ The following is a non-exhaustive list of unimplemented/undocumented features:
+ non-volatile bit code sequences, monitor and wait, globally accessing scratch atomics,
+ multicast loads, barriers (including split barriers) and cooperative atomics.
+ Scalar operations memory model needs more elaboration as well.
+
+* Vector memory operations are performed as wavefront wide operations, with the
+ ``EXEC`` mask predicating which lanes execute.
+* Consecutive vector memory operations from the same wavefront are issued in program order.
+ Vector memory operations are issued (and executed) in no particular order between wavefronts.
+* Wave execution of a vector memory operation instruction issues (initiates) the operation,
+ but completion occurs an unspecified amount of time later.
+ The ``s_wait_*cnt`` instructions must be used to determine if the operation has completed.
+* The types of vector memory operations (and their associated ``s_wait_*cnt`` instructions) are:
+
+ * Load (global, scratch, flat, buffer): ``s_wait_loadcnt``
+ * Store (global, scratch, flat, buffer): ``s_wait_storecnt``
+ * non-ASYNC LDS: ``s_wait_dscnt``
+ * ASYNC LDS: ``s_wait_asynccnt``
+ * Tensor: ``s_wait_tensorcnt``
+
+* ``s_wait_xcnt`` is a counter that is incremented when a memory operation is issued, and
+ decremented when memory address translation for that instruction is completed.
+ Waiting on a memory counter ``s_wait_*cnt N`` also waits on ``s_wait_xcnt N``.
+
+ * ``s_wait_xcnt 0x0`` is required before flat and global atomic stores/read-modify-write
+ operations to guarantee atomicity during a xnack replay.
+
+* Within a wavefront, vector memory operation completion (``s_wait_*cnt`` decrement) is
+ reported in order of issue within a type, but in no particular order between types.
+* Within a wavefront, the order in which data is returned to registers by a vector memory
+ operation can be
diff erent from the order in which the vector memory operations were issued.
+
+ * Thus, a ``s_wait_*cnt`` instruction must be used to prevent multiple vector memory operations
+ that return results to the same register from executing concurrently as they may not return
+ their results in instruction issue order, even though they will be reported as completed in
+ instruction issue order by the decrementing of the counter.
+
+* Within a wavefront, consecutive loads and store to the same address will be processed in program order
+ by the memory subsystem. Loads and stores to
diff erent addresses may be processed
+ out of order with respect to a
diff erent address.
+* All non-ASYNC LDS vector memory operations of a WGP are performed as wavefront wide
+ operations in a global order and involve no caching. Completion is reported to a wavefront in
+ execution order.
+* ASYNC LDS and tensor vector memory operations are not covered by the memory model implemented
+ by the AMDGPU backend. Neither ``s_wait_asynccnt`` nor ``s_wait_tensorcnt`` are inserted
+ automatically. They must be emitted using compiler built-in calls.
+* Some vector memory operations contain a ``SCOPE`` field with values
+ corresponding to each cache level. The ``SCOPE`` determines whether a cache
+ can complete an operation locally or whether it needs to forward the operation
+ to the next cache level. The ``SCOPE`` values are:
+
+ * ``SCOPE_CU``: WGP
+ * ``SCOPE_SE``: Shader Engine
+ * ``SCOPE_DEV``: Device/Agent
+ * ``SCOPE_SYS``: System
+
+* Each cache is assigned a ``SCOPE`` by the hardware depending on the agent's
+ configuration.
+
+ * This ensures that ``SCOPE_DEV`` can always be used to implement agent coherence,
+ even in the presence of multiple non-coherent L2 caches on the same agent.
+
+* When a vector memory operation with a given ``SCOPE`` reaches a cache with a smaller
+ ``SCOPE`` value, it is forwarded to the next level of cache.
+* When a vector memory operation with a given ``SCOPE`` reaches a cache with a ``SCOPE``
+ value greater than or equal to its own, the operation can proceed:
+
+ * Reads can hit into the cache.
+ * Writes can happen in this cache and completion (``s_wait`` decrement) can be
+ reported.
+ * RMW operations can be done locally.
+
+* Some memory operations contain a ``nv`` bit, for "non-volatile", which indicates
+ memory that is not expected to change during a kernel's execution.
+ This information is propagated to the cache lines for that address
+ (refered to as ``$nv``).
+
+ * When ``nv=0`` reads hit dirty ``$nv=1`` data in cache, the hardware will
+ writeback the data to the next level in the hierarchy and then subsequently read
+ it again, updating the cache line with a clean ``$nv=0`` copy of the data.
+
+* ``global_inv``, ``global_wb`` and ``global_wbinv`` are cache control instructions.
+ The affected cache(s) are controlled by the ``SCOPE`` of the instruction.
+ Only caches whose scope is strictly smaller than the instruction's are affected.
+
+ * ``global_inv`` invalidates the data in affected caches so that subsequent reads
+ will re-read from the next level in the cache hierarchy.
+ The invalidation requests cannot be reordered with pending or upcoming
+ memory operations. Instruction completion is reported using ``s_wait_loadcnt``.
+ * ``global_wb`` flushes the dirty data in affected caches to the next level in
+ the cache hierarchy. This instruction additionally ensures previous
+ memory operation done at a lower scope level have reached the desired
+ ``SCOPE:``. Instruction completion is reported using ``s_wait_storecnt`` once
+ all data has been acknowledged by the next level in the cache hierarchy.
+ * ``global_wbinv`` performs a ``global_inv`` then a ``global_wb``.
+ Instruction completion is reported using ``s_wait_storecnt``.
+ * ``global_inv``, ``global_wb`` and ``global_wbinv`` with ``nv=0`` can only
+ affect ``$nv=0`` cache lines, whereas ``nv=1`` can affect all cache lines.
+ * ``global_inv``, ``global_wb`` and ``global_wbinv`` behave like memory operations
+ issued to every address at the same time. They are kept in order with other
+ memory operations from the same wave.
+
+Scalar memory operations are only used to access memory that is proven to not
+change during the execution of the kernel dispatch. This includes constant
+address space and global address space for program scope ``const`` variables.
+Therefore, the kernel machine code does not have to maintain the scalar cache to
+ensure it is coherent with the vector caches. The scalar and vector caches are
+invalidated between kernel dispatches by CP since constant address space data
+may change between kernel dispatch executions. See
+:ref:`amdgpu-amdhsa-memory-spaces`.
+
+Atomics in the scratch address space are handled as follows:
+
+* Data types <= 32 bits: The instruction is converted into an atomic in the
+ generic (``flat``) address space. All properties of the atomic
+ (atomic ordering, volatility, alignment, etc.) are preserved.
+ Refer to the generic address space code sequences for further information.
+* Data types >32 bits: unsupported and an error is emitted.
+
+The code sequences used to implement the memory model for GFX125x are defined in
+table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-table`.
+
+The mapping of LLVM IR syncscope to GFX125x instruction ``scope`` operands is
+defined in :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+The table only applies if and only if it is directly referenced by an entry in
+:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-table`, and it only applies to
+the instruction in the code sequence that references the table.
+
+ .. table:: AMDHSA Memory Model Code Sequences GFX125x - Instruction Scopes
+ :name: amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table
+
+ ================================= =======================
+ LLVM syncscope ISA
+
+
+ ================================= =======================
+ *none*, one-as ``scope:SCOPE_SYS``
+ system, system-one-as ``scope:SCOPE_SYS``
+ agent, agent-one-as ``scope:SCOPE_DEV``
+ cluster, cluster-one-as ``scope:SCOPE_SE``
+ workgroup, workgroup-one-as ``scope:SCOPE_CU`` [1]_
+ wavefront, wavefront-one-as ``scope:SCOPE_CU`` [1]_
+ singlethread, singlethread-one-as ``scope:SCOPE_CU`` [1]_
+ ================================= =======================
+
+ .. [1] ``SCOPE_CU`` is the default ``scope:`` emitted by the compiler.
+ It will be omitted when instructions are emitted in textual form by the compiler.
+
+ .. table:: AMDHSA Memory Model Code Sequences GFX125x
+ :name: amdgpu-amdhsa-memory-model-code-sequences-gfx125x-table
+
+ ============ ============ ============== ========== ================================
+ LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
+ Ordering Sync Scope Address GFX125x
+ Space
+ ============ ============ ============== ========== ================================
+ **Non-Atomic**
+ ------------------------------------------------------------------------------------
+ load *none* *none* - global - !volatile & !nontemporal
+ - generic
+ - private 1. buffer/global/flat_load
+ - constant
+ - !volatile & nontemporal
+
+ 1. buffer/global/flat_load
+ ``th:TH_LOAD_NT``
+
+ - volatile
+
+ 1. buffer/global/flat_load
+ ``scope:SCOPE_SYS``
+
+ 2. ``s_wait_loadcnt 0x0``
+
+ - Must happen before
+ any following volatile
+ global/generic
+ load/store.
+ - Ensures that
+ volatile
+ operations to
+
diff erent
+ addresses will not
+ be reordered by
+ hardware.
+
+ load *none* *none* - local 1. ds_load
+ store *none* *none* - global - !volatile & !nontemporal
+ - generic
+ - private 1. buffer/global/flat_store
+ - constant
+ - !volatile & nontemporal
+
+ 1. buffer/global/flat_store
+ ``th:TH_STORE_NT``
+
+ - volatile
+
+ 1. buffer/global/flat_store
+ ``scope:SCOPE_SYS``
+
+ 2. ``s_wait_storecnt 0x0``
+
+ - Must happen before
+ any following volatile
+ global/generic
+ load/store.
+ - Ensures that
+ volatile
+ operations to
+
diff erent
+ addresses will not
+ be reordered by
+ hardware.
+
+ store *none* *none* - local 1. ds_store
+ **Unordered Atomic**
+ ------------------------------------------------------------------------------------
+ load atomic unordered *any* *any* *Same as non-atomic*.
+ store atomic unordered *any* *any* *Same as non-atomic*.
+ atomicrmw unordered *any* *any* *Same as monotonic atomic*.
+ **Monotonic Atomic**
+ ------------------------------------------------------------------------------------
+ load atomic monotonic - singlethread - global 1. buffer/global/flat_load
+ - wavefront - generic
+ - workgroup - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - cluster
+ - agent
+ - system
+ load atomic monotonic - singlethread - local 1. ds_load
+ - wavefront
+ - workgroup
+ store atomic monotonic - singlethread - global 1. ``s_wait_xcnt 0x0``
+ - wavefront - generic
+ - workgroup - Ensure operation remains atomic even during a xnack replay.
+ - cluster - Only needed for ``flat`` and ``global`` operations.
+ - agent
+ - system 2. buffer/global/flat_store
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ store atomic monotonic - singlethread - local 1. ds_store
+ - wavefront
+ - workgroup
+ atomicrmw monotonic - singlethread - global 1. ``s_wait_xcnt 0x0``
+ - wavefront - generic
+ - workgroup - Ensure operation remains atomic even during a xnack replay.
+ - cluster - Only needed for ``flat`` and ``global`` operations.
+ - agent
+ - system 2. buffer/global/flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ atomicrmw monotonic - singlethread - local 1. ds_atomic
+ - wavefront
+ - workgroup
+ **Acquire Atomic**
+ ------------------------------------------------------------------------------------
+ load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
+ - wavefront - local
+ - generic
+ load atomic acquire - workgroup - global 1. buffer/global_load
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ 2. ``s_wait_loadcnt 0x0``
+
+ - Must happen before any following
+ global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+
+
+ load atomic acquire - workgroup - local 1. ds_load
+ 2. ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit.
+ - Must happen before any following
+ global/generic load/load
+ atomic/store/store
+ atomic/atomicrmw.
+ - Ensures any
+ following global
+ data read is no
+ older than the local load
+ atomic value being
+ acquired.
+
+
+ load atomic acquire - workgroup - generic 1. flat_load
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ 2. | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``
+ - Must happen before any
+ following global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+ - Ensures any
+ following global
+ data read is no
+ older than a local load
+ atomic value being
+ acquired.
+
+ load atomic acquire - cluster - global 1. buffer/global_load
+ - agent
+ - system - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ 2. ``s_wait_loadcnt 0x0``
+
+ - Must happen before
+ following
+ ``global_inv``.
+ - Ensures the load
+ has completed
+ before invalidating
+ the caches.
+
+ 3. ``global_inv``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following
+ loads will not see
+ stale global data.
+
+ load atomic acquire - cluster - generic 1. flat_load
+ - agent
+ - system - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ 2. | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``
+ - Must happen before
+ following
+ ``global_inv``.
+ - Ensures the flat_load
+ has completed
+ before invalidating
+ the caches.
+
+ 3. ``global_inv``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data.
+
+ atomicrmw acquire - singlethread - global 1. ``s_wait_xcnt 0x0``
+ - wavefront - local
+ - generic - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``flat`` and ``global`` operations.
+
+ 2. buffer/global/ds/flat_atomic
+
+ atomicrmw acquire - workgroup - global 1. ``s_wait_xcnt 0x0``
+
+ - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``flat`` and ``global`` operations.
+
+ 2. buffer/global_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - If atomic with return,
+ use ``th:TH_ATOMIC_RETURN``
+
+ 3. | **Atomic with return:**
+ | ``s_wait_loadcnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+
+ - Must happen before any following
+ global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+
+ atomicrmw acquire - workgroup - local 1. ds_atomic
+ 2. ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit.
+ - Ensures any
+ following global
+ data read is no
+ older than the local
+ atomicrmw value
+ being acquired.
+
+
+ atomicrmw acquire - workgroup - generic 1. ``s_wait_xcnt 0x0``
+
+ - Ensure operation remains atomic even during a xnack replay.
+
+ 2. flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - If atomic with return,
+ use ``th:TH_ATOMIC_RETURN``
+
+ 3. | **Atomic with return:**
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``
+ - Ensures any
+ following global
+ data read is no
+ older than the local
+ atomicrmw value
+ being acquired.
+
+ atomicrmw acquire - cluster - global 1. ``s_wait_xcnt 0x0``
+ - agent
+ - system - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``global`` operations.
+
+ 2. buffer/global_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - If atomic with return,
+ use ``th:TH_ATOMIC_RETURN``
+
+ 3. | **Atomic with return:**
+ | ``s_wait_loadcnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+
+ - Must happen before
+ following ``global_inv``.
+ - Ensures the
+ atomicrmw has
+ completed before
+ invalidating the
+ caches.
+
+ 4. ``global_inv``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data.
+
+ atomicrmw acquire - cluster - generic 1. ``s_wait_xcnt 0x0``
+ - agent
+ - system - Ensure operation remains atomic even during a xnack replay.
+
+ 2. flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - If atomic with return,
+ use ``th:TH_ATOMIC_RETURN``
+
+ 3. | **Atomic with return:**
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit dscnt
+ - Must happen before
+ following
+ global_inv
+ - Ensures the
+ atomicrmw has
+ completed before
+ invalidating the
+ caches.
+
+ 4. ``global_inv``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data.
+
+ fence acquire - singlethread *none* *none*
+ - wavefront
+ fence acquire - workgroup *none* 1. | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``
+ - If OpenCL and address space is local,
+ omit all.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load
+ atomic/
+ atomicrmw-with-return-value
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ atomicrmw-no-return-value
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic load
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - Ensures that the
+ fence-paired atomic
+ has completed
+ before invalidating
+ the
+ cache. Therefore
+ any following
+ locations read must
+ be no older than
+ the value read by
+ the
+ fence-paired-atomic.
+
+
+ fence acquire - cluster *none* 1. | ``s_wait_storecnt 0x0``
+ - agent | ``s_wait_loadcnt 0x0``
+ - system | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - If OpenCL and address space is
+ local, omit all.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load
+ atomic/
+ atomicrmw-with-return-value
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ atomicrmw-no-return-value
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic load
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - Must happen before
+ the following
+ ``global_inv``
+ - Ensures that the
+ fence-paired atomic
+ has completed
+ before invalidating the
+ caches. Therefore
+ any following
+ locations read must
+ be no older than
+ the value read by
+ the
+ fence-paired-atomic.
+
+ 2. ``global_inv``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ **Release Atomic**
+ ------------------------------------------------------------------------------------
+ store atomic release - singlethread - global 1. ``s_wait_xcnt 0x0``
+ - wavefront - local
+ - generic - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``flat`` and ``global`` operations.
+
+ 2. buffer/global/ds/flat_store
+
+ store atomic release - workgroup - global 1. | ``s_wait_storecnt 0x0``
+ - cluster - generic | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before the
+ following store.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ store that is being
+ released.
+
+ 2. ``s_wait_xcnt 0x0``
+
+ - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``flat`` and ``global`` operations.
+
+ 3. buffer/global/flat_store
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ store atomic release - workgroup - local 1. | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
+ - Must happen before the
+ following store.
+ - Ensures that all
+ global memory
+ operations have
+ completed before
+ performing the
+ store that is being
+ released.
+
+ 2. ds_store
+ store atomic release - agent - global 1. ``global_wb``
+ - system - generic
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ 2. | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb`` or
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before the
+ following store.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ store that is being
+ released.
+
+ 3. ``s_wait_xcnt 0x0``
+
+ - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``flat`` and ``global`` operations.
+
+ 4. buffer/global/flat_store
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ atomicrmw release - singlethread - global 1. ``s_wait_xcnt 0x0``
+ - wavefront - local
+ - generic - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``flat`` and ``global`` operations.
+
+ 2. buffer/global/ds/flat_atomic
+ atomicrmw release - workgroup - global 1. | ``s_wait_storecnt 0x0``
+ - cluster - generic | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before the
+ following atomic.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ atomicrmw that is
+ being released.
+
+ 2. ``s_wait_xcnt 0x0``
+
+ - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``flat`` and ``global`` operations.
+
+ 3. buffer/global/flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ atomicrmw release - workgroup - local 1. | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit all.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
+ - Must happen before the
+ following atomic.
+ - Ensures that all
+ global memory
+ operations have
+ completed before
+ performing the
+ store that is being
+ released.
+
+ 2. ds_atomic
+ atomicrmw release - agent - global 1. ``global_wb``
+ - system - generic
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ 2. | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb`` or
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before the
+ following atomic.
+ - Ensures that all
+ memory operations
+ to global and local
+ have completed
+ before performing
+ the atomicrmw that
+ is being released.
+
+ 3. ``s_wait_xcnt 0x0``
+
+ - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``flat`` and ``global`` operations.
+
+ 4. buffer/global/flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ fence release - singlethread *none* *none*
+ - wavefront
+ fence release - workgroup *none* 1. | ``s_wait_storecnt 0x0``
+ - cluster | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - If OpenCL and
+ address space is
+ local, omit all.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store atomic/
+ atomicrmw.
+ - Must happen before
+ any following store
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ following
+ fence-paired-atomic.
+
+ fence release - agent *none* 1. ``global_wb``
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ 2. | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **OpenCL:**
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+
+ - If OpenCl, omit ``s_wait_dscnt 0x0``.
+ - If OpenCL and address space is local,
+ omit all.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb`` or
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ any following store
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ following
+ fence-paired-atomic.
+
+ **Acquire-Release Atomic**
+ ------------------------------------------------------------------------------------
+ atomicrmw acq_rel - singlethread - global 1. ``s_wait_xcnt 0x0``
+ - wavefront - local
+ - generic - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``flat`` and ``global`` operations.
+
+ 2. buffer/global/ds/flat_atomic
+ atomicrmw acq_rel - workgroup - global 1. | ``s_wait_storecnt 0x0``
+ - cluster | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - Must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ atomicrmw.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ atomicrmw that is
+ being released.
+
+ 2. ``s_wait_xcnt 0x0``
+
+ - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``flat`` and ``global`` operations.
+
+ 3. buffer/global_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - If atomic with return, use
+ ``th:TH_ATOMIC_RETURN``.
+
+ 4. | **Atomic with return:**
+ | ``s_wait_loadcnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+
+ - Ensures any
+ following global
+ data read is no
+ older than the
+ atomicrmw value
+ being acquired.
+
+ atomicrmw acq_rel - workgroup - local 1 | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
+ - Must happen before
+ the following
+ store.
+ - Ensures that all
+ global memory
+ operations have
+ completed before
+ performing the
+ store that is being
+ released.
+
+ 2. ds_atomic
+ 3. ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit.
+ - Ensures any
+ following global
+ data read is no
+ older than the local load
+ atomic value being
+ acquired.
+
+ atomicrmw acq_rel - workgroup - generic 1. | ``s_wait_storecnt 0x0``
+ - cluster | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_loadcnt 0x0``.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ store/store
+ atomic/
+ atomicrmw-no-return-value.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ atomicrmw.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ atomicrmw that is
+ being released.
+
+ 2. ``s_wait_xcnt 0x0``
+
+ - Ensure operation remains atomic even during a xnack replay.
+
+ 3. flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - If atomic with return,
+ use ``th:TH_ATOMIC_RETURN``.
+
+ 4. | **Atomic without return:**
+ | ``s_wait_dscnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | **Atomic with return:**
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``
+ - Ensures any
+ following global
+ data read is no
+ older than the load
+ atomic value being
+ acquired.
+
+
+ atomicrmw acq_rel - agent - global 1. ``global_wb``
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ 2. | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit
+ ``s_wait_dscnt 0x0``
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ atomicrmw.
+ - Ensures that all
+ memory operations
+ to global have
+ completed before
+ performing the
+ atomicrmw that is
+ being released.
+
+ 2. ``s_wait_xcnt 0x0``
+
+ - Ensure operation remains atomic even during a xnack replay.
+ - Only needed for ``global`` operations.
+
+ 3. buffer/global_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - If atomic with return, use
+ ``th:TH_ATOMIC_RETURN``.
+
+ 4. | **Atomic with return:**
+ | ``s_wait_loadcnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+
+ - Must happen before
+ following
+ ``global_inv``.
+ - Ensures the
+ atomicrmw has
+ completed before
+ invalidating the
+ caches.
+
+ 5. ``global_inv``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data.
+
+ atomicrmw acq_rel - agent - generic 1. ``global_wb``
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ 2. | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit
+ ``s_wait_dscnt 0x0``
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load atomic
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ atomicrmw.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ atomicrmw that is
+ being released.
+
+ 3. ``s_wait_xcnt 0x0``
+
+ - Ensure operation remains atomic even during a xnack replay.
+
+ 4. flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - If atomic with return, use
+ ``th:TH_ATOMIC_RETURN``.
+
+ 5. | **Atomic with return:**
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+
+ - If OpenCL, omit
+ ``s_wait_dscnt 0x0``.
+ - Must happen before
+ following
+ ``global_inv``.
+ - Ensures the
+ atomicrmw has
+ completed before
+ invalidating the
+ caches.
+
+ 5. ``global_inv``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data.
+
+ fence acq_rel - singlethread *none* *none*
+ - wavefront
+ fence acq_rel - workgroup *none* 1. | ``s_wait_storecnt 0x0``
+ - cluster | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL and
+ address space is
+ not generic, omit
+ ``s_wait_dscnt 0x0``
+ - If OpenCL and
+ address space is
+ local, omit
+ all but ``s_wait_dscnt 0x0``.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ store/store atomic/
+ atomicrmw-no-return-value.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store atomic/
+ atomicrmw.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing any
+ following global
+ memory operations.
+ - Ensures that the
+ preceding
+ local/generic load
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ acquire-fence-paired-atomic)
+ has completed
+ before following
+ global memory
+ operations. This
+ satisfies the
+ requirements of
+ acquire.
+ - Ensures that all
+ previous memory
+ operations have
+ completed before a
+ following
+ local/generic store
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ release-fence-paired-atomic).
+ This satisfies the
+ requirements of
+ release.
+ - Ensures that the
+ acquire-fence-paired
+ atomic has completed
+ before invalidating
+ the
+ cache. Therefore
+ any following
+ locations read must
+ be no older than
+ the value read by
+ the
+ acquire-fence-paired-atomic.
+
+ fence acq_rel - agent *none* 1. ``global_wb``
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+
+ 2. | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL and
+ address space is
+ not generic, omit
+ ``s_wait_dscnt 0x0``
+ - If OpenCL and
+ address space is
+ local, omit
+ all but ``s_wait_dscnt 0x0``.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ ``global_inv``
+ - Ensures that the
+ preceding
+ global/local/generic
+ load
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ acquire-fence-paired-atomic)
+ has completed
+ before invalidating
+ the caches. This
+ satisfies the
+ requirements of
+ acquire.
+ - Ensures that all
+ previous memory
+ operations have
+ completed before a
+ following
+ global/local/generic
+ store
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ release-fence-paired-atomic).
+ This satisfies the
+ requirements of
+ release.
+
+ 3. ``global_inv``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx125x-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data. This
+ satisfies the
+ requirements of
+ acquire.
+
+ **Sequential Consistent Atomic**
+ ------------------------------------------------------------------------------------
+ load atomic seq_cst - singlethread - global *Same as corresponding
+ - wavefront - local load atomic acquire,
+ - generic except must generate
+ all instructions even
+ for OpenCL.*
+ load atomic seq_cst - workgroup - global 1. | ``s_wait_storecnt 0x0``
+ - generic | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit
+ ``s_wait_dscnt 0x0``
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_dscnt 0x0`` must
+ happen after
+ preceding
+ local/generic load
+ atomic/store
+ atomic/atomicrmw
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait_dscnt 0x0``
+ and so do not need to be
+ considered.)
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ preceding
+ global/generic load
+ atomic/
+ atomicrmw-with-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own waits and so do
+ not need to be
+ considered.)
+ - ``s_wait_storecnt 0x0``
+ Must happen after
+ preceding
+ global/generic store
+ atomic/
+ atomicrmw-no-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait_storecnt 0x0``
+ and so do not need to be
+ considered.)
+ - Ensures any
+ preceding
+ sequential
+ consistent global/local
+ memory instructions
+ have completed
+ before executing
+ this sequentially
+ consistent
+ instruction. This
+ prevents reordering
+ a seq_cst store
+ followed by a
+ seq_cst load. (Note
+ that seq_cst is
+ stronger than
+ acquire/release as
+ the reordering of
+ load acquire
+ followed by a store
+ release is
+ prevented by the
+ ``s_wait``\s of
+ the release, but
+ there is nothing
+ preventing a store
+ release followed by
+ load acquire from
+ completing out of
+ order. The ``s_wait``\s
+ could be placed after
+ seq_store or before
+ the seq_load. We
+ choose the load to
+ make the ``s_wait``\s be
+ as late as possible
+ so that the store
+ may have already
+ completed.)
+
+ 2. *Following
+ instructions same as
+ corresponding load
+ atomic acquire,
+ except must generate
+ all instructions even
+ for OpenCL.*
+ load atomic seq_cst - workgroup - local 1. | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit all.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ preceding
+ global/generic load
+ atomic/
+ atomicrmw-with-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait``\s and so do
+ not need to be
+ considered.)
+ - ``s_wait_storecnt 0x0``
+ Must happen after
+ preceding
+ global/generic store
+ atomic/
+ atomicrmw-no-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait_storecnt 0x0``
+ and so do
+ not need to be
+ considered.)
+ - Ensures any
+ preceding
+ sequential
+ consistent global
+ memory instructions
+ have completed
+ before executing
+ this sequentially
+ consistent
+ instruction. This
+ prevents reordering
+ a seq_cst store
+ followed by a
+ seq_cst load. (Note
+ that seq_cst is
+ stronger than
+ acquire/release as
+ the reordering of
+ load acquire
+ followed by a store
+ release is
+ prevented by the
+ ``s_wait``\s of
+ the release, but
+ there is nothing
+ preventing a store
+ release followed by
+ load acquire from
+ completing out of
+ order. The s_waitcnt
+ could be placed after
+ seq_store or before
+ the seq_load. We
+ choose the load to
+ make the ``s_wait``\s be
+ as late as possible
+ so that the store
+ may have already
+ completed.)
+
+ 2. *Following
+ instructions same as
+ corresponding load
+ atomic acquire,
+ except must generate
+ all instructions even
+ for OpenCL.*
+
+ load atomic seq_cst - cluster - global 1. | ``s_wait_storecnt 0x0``
+ - agent - generic | ``s_wait_loadcnt 0x0``
+ - system | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit
+ ``s_wait_dscnt 0x0``
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ preceding
+ local load
+ atomic/store
+ atomic/atomicrmw
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait_dscnt 0x0``
+ and so do
+ not need to be
+ considered.)
+ - ``s_wait_loadcnt 0x0``
+ must happen after
+ preceding
+ global/generic load
+ atomic/
+ atomicrmw-with-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait``\s and so do
+ not need to be
+ considered.)
+ - ``s_wait_storecnt 0x0``
+ Must happen after
+ preceding
+ global/generic store
+ atomic/
+ atomicrmw-no-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own
+ ``s_wait_storecnt 0x0`` and so do
+ not need to be
+ considered.)
+ - Ensures any
+ preceding
+ sequential
+ consistent global
+ memory instructions
+ have completed
+ before executing
+ this sequentially
+ consistent
+ instruction. This
+ prevents reordering
+ a seq_cst store
+ followed by a
+ seq_cst load. (Note
+ that seq_cst is
+ stronger than
+ acquire/release as
+ the reordering of
+ load acquire
+ followed by a store
+ release is
+ prevented by the
+ ``s_wait``\s of
+ the release, but
+ there is nothing
+ preventing a store
+ release followed by
+ load acquire from
+ completing out of
+ order. The ``s_wait``\s
+ could be placed after
+ seq_store or before
+ the seq_load. We
+ choose the load to
+ make the ``s_wait``\s be
+ as late as possible
+ so that the store
+ may have already
+ completed.)
+
+ 2. *Following
+ instructions same as
+ corresponding load
+ atomic acquire,
+ except must generate
+ all instructions even
+ for OpenCL.*
+ store atomic seq_cst - singlethread - global *Same as corresponding
+ - wavefront - local store atomic release,
+ - workgroup - generic except must generate
+ - cluster all instructions even
+ - agent for OpenCL.*
+ - system
+ atomicrmw seq_cst - singlethread - global *Same as corresponding
+ - wavefront - local atomicrmw acq_rel,
+ - workgroup - generic except must generate
+ - cluster all instructions even
+ - agent for OpenCL.*
+ - system
+ fence seq_cst - singlethread *none* *Same as corresponding
+ - wavefront fence acq_rel,
+ - workgroup except must generate
+ - cluster all instructions even
+ - agent for OpenCL.*
+ - system
+ ============ ============ ============== ========== ================================
+
+.. _amdgpu-amdhsa-memory-model-gfx125x-cooperative-atomics:
+
+'``llvm.amdgcn.cooperative.atomic``' Intrinsics
+###############################################
+
+The collection of convergent threads participating in a cooperative atomic must belong
+to the same wave32.
+
+Only naturally-aligned, contiguous groups of lanes may be used;
+see :ref:`the table below<gfx125x-cooperative-atomic-intrinsics-table>` for the set of
+possible lane groups.
+Cooperative atomics may be executed by more than one group per wave.
+Using an unsupported lane group, or using more lane groups per wave than the maximum will
+cause undefined behavior.
+
+Using the intrinsic also causes undefined behavior if it loads or stores to addresses that:
+
+* Are not in the global address space (e.g.: private and local addresses spaces).
+* Are only reachable through a bus that does not support 128B/256B requests
+ (e.g.: host memory over PCIe)
+* Any other unsupported addresses (TBD, needs refinement)
+
+.. TODO::
+
+ Enumerate all cases where UB is invoked when using this intrinsic instead of hand-waving
+ "specific global memory locations".
+
+.. table:: GFX125x Cooperative Atomic Intrinsics
+ :name: gfx125x-cooperative-atomic-intrinsics-table
+
+ ======================================================= =======================================
+ LLVM Intrinsic Lane Groups
+ ======================================================= =======================================
+ ``llvm.amdgcn.cooperative.atomic.store.32x4B`` ``0-31``
+
+ ``llvm.amdgcn.cooperative.atomic.load.32x4B`` ``0-31``
+
+ ``llvm.amdgcn.cooperative.atomic.store.16x8B`` ``0-15``, ``16-31``
+
+ ``llvm.amdgcn.cooperative.atomic.load.16x8B`` ``0-15``, ``16-31``
+
+ ``llvm.amdgcn.cooperative.atomic.store.8x16B`` ``0-7``, ``8-15``, ``16-23``, ``24-31``
+
+ ``llvm.amdgcn.cooperative.atomic.load.8x16B`` ``0-7``, ``8-15``, ``16-23``, ``24-31``
+
+ ======================================================= =======================================
+
.. _amdgpu-amdhsa-trap-handler-abi:
Trap Handler ABI
More information about the llvm-commits
mailing list