[llvm] [AMDGPU] Document GFX12 Memory Model (PR #98599)
Pierre van Houtryve via llvm-commits
llvm-commits at lists.llvm.org
Mon Aug 19 03:10:58 PDT 2024
https://github.com/Pierre-vh updated https://github.com/llvm/llvm-project/pull/98599
>From 65095565bdf729a0afe1ecdfdd793b7e1d71e8a9 Mon Sep 17 00:00:00 2001
From: pvanhout <pierre.vanhoutryve at amd.com>
Date: Fri, 12 Jul 2024 09:37:08 +0200
Subject: [PATCH 1/6] [AMDGPU] Document GFX12 Memory Model
Document the memory model implemented as of #98591
---
llvm/docs/AMDGPUUsage.rst | 2261 +++++++++++++++++++++++++++++++++++++
1 file changed, 2261 insertions(+)
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 97fca32d4ece66..816f7fb00ee98e 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -6098,6 +6098,7 @@ following sections:
* :ref:`amdgpu-amdhsa-memory-model-gfx90a`
* :ref:`amdgpu-amdhsa-memory-model-gfx942`
* :ref:`amdgpu-amdhsa-memory-model-gfx10-gfx11`
+* :ref:`amdgpu-amdhsa-memory-model-gfx12`
.. _amdgpu-fence-as:
@@ -14078,6 +14079,2266 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx10-gfx11-table`.
- system for OpenCL.*
============ ============ ============== ========== ================================
+
+.. _amdgpu-amdhsa-memory-model-gfx12:
+
+Memory Model GFX12
+++++++++++++++++++++++++
+
+For GFX12:
+
+* Each agent has multiple shader arrays (SA).
+* Each SA has multiple work-group processors (WGP).
+* Each WGP has multiple compute units (CU).
+* Each CU has multiple SIMDs that execute wavefronts.
+* The wavefronts for a single work-group are executed in the same
+ WGP.
+
+ * In CU wavefront execution mode the wavefronts may be executed by different SIMDs
+ in the same CU.
+ * In WGP wavefront execution mode the wavefronts may be executed by different SIMDs
+ in different CUs in the same WGP.
+
+* Each WGP has a single LDS memory shared by the wavefronts of the work-groups
+ executing on it.
+* All LDS operations of a WGP are performed as wavefront wide operations in a
+ global order and involve no caching. Completion is reported to a wavefront in
+ execution order.
+* The LDS memory has multiple request queues shared by the SIMDs of a
+ WGP. Therefore, the LDS operations performed by different wavefronts of a
+ work-group can be reordered relative to each other, which can result in
+ reordering the visibility of vector memory operations with respect to LDS
+ operations of other wavefronts in the same work-group. A ``s_wait_dscnt 0x0``
+ is required to ensure synchronization between LDS operations and
+ vector memory operations between wavefronts of a work-group, but not between
+ operations performed by the same wavefront.
+* The vector memory operations are performed as wavefront wide operations.
+ Vector memory operations are divided in different types. Completion of a
+ vector memory operation is reported to a wavefront in-order within a type,
+ but may be out of order between types. The types of vector memory operations
+ (and their associated ``s_wait`` instructions) are:
+
+ * LDS: ``s_wait_dscnt``
+ * Load (global, scratch, flat, buffer and image): ``s_wait_loadcnt``
+ * Store (global, scratch, flat, buffer and image): ``s_wait_storecnt``
+ * Sample and Gather4: ``s_wait_samplecnt``
+ * BVH: ``s_wait_bvhcnt``
+
+* Vector and scalar memory instructions contain a ``SCOPE`` field with values
+ corresponding to each cache level. The ``SCOPE`` determines whether a cache
+ can complete an operation locally or whether it needs to forward the operation
+ to the next cache level. The ``SCOPE`` values are:
+
+ * ``SCOPE_CU``: Compute Unit (NOTE: not affected by CU/WGP mode)
+ * ``SCOPE_SE``: Shader Engine
+ * ``SCOPE_DEV``: Device/Agent
+ * ``SCOPE_SYS``: System
+
+* When a memory operation with a given ``SCOPE`` reaches a cache with a smaller
+ ``SCOPE`` value, it is forwarded to the next level of cache.
+* When a memory operation with a given ``SCOPE`` reaches a cache with a ``SCOPE``
+ value greater than or equal to its own, the operation can proceed:
+
+ * Reads can hit into the cache
+ * Writes can happen in this cache and the transaction is acknowledged
+ from this level of cache.
+ * RMW operations can be done locally.
+
+* ``global_inv``, ``global_wb`` and ``global_wbinv`` instructions are used to
+ invalidate, write-back and write-back+invalidate caches. The affected
+ cache(s) are controlled by the ``SCOPE:`` of the instruction.
+* ``global_inv`` invalidates caches whose scope is strictly smaller than the
+ instruction's. The invalidation requests cannot be reordered with pending or
+ upcoming memory operations.
+* ``global_wb`` additionally ensures that previous memory operation done at
+ a lower scope level have reached the ``SCOPE:`` of the ``global_wb``.
+* The vector memory operations access a vector L0 cache. There is a single L0
+ cache per CU. Each SIMD of a CU accesses the same L0 cache. Therefore, no
+ special action is required for coherence between the lanes of a single
+ wavefront. To achieve coherence between wavefronts executing in the same
+ work-group:
+
+ * In CU wavefront execution mode, no special action is required.
+ * In WGP wavefront execution mode, a ``global_inv scope:SCOPE_CU`` is required
+ as wavefronts may be executing on SIMDs of different CUs that access different L0s.
+
+* The scalar memory operations access a scalar L0 cache shared by all wavefronts
+ on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
+ operations are used in a restricted way so do not impact the memory model. See
+ :ref:`amdgpu-amdhsa-memory-spaces`.
+* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
+ the same SA. Therefore, no special action is required for coherence between
+ the wavefronts of a single work-group. However, a ``global_inv scope:SCOPE_DEV`` is
+ required for coherence between wavefronts executing in different work-groups
+ as they may be executing on different SAs that access different L1s.
+* The L1 caches have independent quadrants to service disjoint ranges of virtual
+ addresses.
+* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
+ vector and scalar memory operations performed by different wavefronts, whether
+ executing in the same or different work-groups (which may be executing on
+ different CUs accessing different L0s), can be reordered relative to each
+ other. Some or all of the wait instructions below are required to ensure
+ synchronization between vector memory operations of different wavefronts. It
+ ensures a previous vector memory operation has completed before executing a
+ subsequent vector memory or LDS operation and so can be used to meet the
+ requirements of acquire, release and sequential consistency.
+
+ * ``s_wait_loadcnt 0x0``
+ * ``s_wait_samplecnt 0x0``
+ * ``s_wait_bvhcnt 0x0``
+ * ``s_wait_storecnt 0x0``
+
+* The L1 caches use an L2 cache shared by all SAs on the same agent.
+* The L2 cache has independent channels to service disjoint ranges of virtual
+ addresses.
+* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
+ quadrant has a separate request queue per L2 channel. Therefore, the vector
+ and scalar memory operations performed by wavefronts executing in different
+ work-groups (which may be executing on different SAs) of an agent can be
+ reordered relative to each other. Some or all of the wait instructions below are
+ required to ensure synchronization between vector memory operations of
+ different SAs. It ensures a previous vector memory operation has completed
+ before executing a subsequent vector memory and so can be used to meet the
+ requirements of acquire, release and sequential consistency.
+
+ * ``s_wait_loadcnt 0x0``
+ * ``s_wait_samplecnt 0x0``
+ * ``s_wait_bvhcnt 0x0``
+ * ``s_wait_storecnt 0x0``
+
+* The L2 cache can be kept coherent with other agents, or ranges
+ of virtual addresses can be set up to bypass it to ensure system coherence.
+* A memory attached last level (MALL) cache exists for GPU memory.
+ The MALL cache is fully coherent with GPU memory and has no impact on system
+ coherence. All agents (GPU and CPU) access GPU memory through the MALL cache.
+
+Scalar memory operations are only used to access memory that is proven to not
+change during the execution of the kernel dispatch. This includes constant
+address space and global address space for program scope ``const`` variables.
+Therefore, the kernel machine code does not have to maintain the scalar cache to
+ensure it is coherent with the vector caches. The scalar and vector caches are
+invalidated between kernel dispatches by CP since constant address space data
+may change between kernel dispatch executions. See
+:ref:`amdgpu-amdhsa-memory-spaces`.
+
+For kernarg backing memory:
+
+* CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
+* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
+ needing to invalidate the L2 cache.
+* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
+ so the L2 cache will be coherent with the CPU and other agents.
+
+Scratch backing memory (which is used for the private address space) is accessed
+with MTYPE NC (non-coherent). Since the private address space is only accessed
+by a single thread, and is always write-before-read, there is never a need to
+invalidate these entries from the L0 or L1 caches.
+
+Wavefronts can be executed in WGP or CU wavefront execution mode:
+
+* In WGP wavefront execution mode the wavefronts of a work-group are executed
+ on the SIMDs of both CUs of the WGP. Therefore, explicit management of the per
+ CU L0 caches is required for work-group synchronization. Also accesses to L1
+ at work-group scope need to be explicitly ordered as the accesses from
+ different CUs are not ordered.
+* In CU wavefront execution mode the wavefronts of a work-group are executed on
+ the SIMDs of a single CU of the WGP. Therefore, all global memory access by
+ the work-group access the same L0 which in turn ensures L1 accesses are
+ ordered and so do not require explicit management of the caches for
+ work-group synchronization.
+
+See ``WGP_MODE`` field in
+:ref:`amdgpu-amdhsa-compute_pgm_rsrc1-gfx6-gfx12-table` and
+:ref:`amdgpu-target-features`.
+
+The code sequences used to implement the memory model for GFX12 are defined in
+table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`.
+
+ .. table:: AMDHSA Memory Model Code Sequences GFX12 - Instruction Scopes
+ :name: amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table
+
+ =================== =================== ===================
+ LLVM syncscope CU wavefront WGP wavefront
+ execution execution
+ mode mode
+ =================== =================== ===================
+ *none* ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
+ system ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
+ agent ``scope:SCOPE_DEV`` ``scope:SCOPE_DEV``
+ workgroup *none* ``scope:SCOPE_SE``
+ wavefront *none* *none*
+ singlethread *none* *none*
+ one-as ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
+ system-one-as ``scope:SCOPE_SYS`` ``scope:SCOPE_SYS``
+ agent-one-as ``scope:SCOPE_DEV`` ``scope:SCOPE_DEV``
+ workgroup-one-as *none* ``scope:SCOPE_SE``
+ wavefront-one-as *none* *none*
+ singlethread-one-as *none* *none*
+ =================== =================== ===================
+
+NOTE: The table above applies if and only if it is explicitly referenced by
+a code sequence in :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`.
+
+ .. table:: AMDHSA Memory Model Code Sequences GFX12
+ :name: amdgpu-amdhsa-memory-model-code-sequences-gfx12-table
+
+ ============ ============ ============== ========== ================================
+ LLVM Instr LLVM Memory LLVM Memory AMDGPU AMDGPU Machine Code
+ Ordering Sync Scope Address GFX12
+ Space
+ ============ ============ ============== ========== ================================
+ **Non-Atomic**
+ ------------------------------------------------------------------------------------
+ load *none* *none* - global - !volatile & !nontemporal
+ - generic
+ - private 1. buffer/global/flat_load
+ - constant
+ - !volatile & nontemporal
+
+ 1. buffer/global/flat_load
+ ``th:TH_LOAD_NT``
+
+ - volatile
+
+ 1. buffer/global/flat_load
+ ``scope:SCOPE_SYS``
+
+ 2. ``s_wait_bvhcnt 0x0``
+ ``s_wait_samplecnt 0x0``
+ ``s_wait_loadcnt 0x0``
+
+ - Must happen before
+ any following volatile
+ global/generic
+ load/store.
+ - Ensures that
+ volatile
+ operations to
+ different
+ addresses will not
+ be reordered by
+ hardware.
+
+ load *none* *none* - local 1. ds_load
+ store *none* *none* - global - !volatile & !nontemporal
+ - generic
+ - private 1. buffer/global/flat_store
+ - constant
+ - !volatile & nontemporal
+
+ 1. buffer/global/flat_store
+ ``th:TH_STORE_NT``
+
+ - volatile
+
+ 1. buffer/global/flat_store
+ ``scope:SCOPE_SYS``
+
+ 2. ``s_wait_storecnt 0x0``
+
+ - Must happen before
+ any following volatile
+ global/generic
+ load/store.
+ - Ensures that
+ volatile
+ operations to
+ different
+ addresses will not
+ be reordered by
+ hardware.
+
+ store *none* *none* - local 1. ds_store
+ **Unordered Atomic**
+ ------------------------------------------------------------------------------------
+ load atomic unordered *any* *any* *Same as non-atomic*.
+ store atomic unordered *any* *any* *Same as non-atomic*.
+ atomicrmw unordered *any* *any* *Same as monotonic atomic*.
+ **Monotonic Atomic**
+ ------------------------------------------------------------------------------------
+ load atomic monotonic - singlethread - global 1. buffer/global/flat_load
+ - wavefront - generic
+ - workgroup - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - agent
+ - system
+ load atomic monotonic - singlethread - local 1. ds_load
+ - wavefront
+ - workgroup
+ store atomic monotonic - singlethread - global 1. buffer/global/flat_store
+ - wavefront - generic
+ - workgroup - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - agent
+ - system
+ store atomic monotonic - singlethread - local 1. ds_store
+ - wavefront
+ - workgroup
+ atomicrmw monotonic - singlethread - global 1. buffer/global/flat_atomic
+ - wavefront - generic
+ - workgroup - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - agent
+ - system
+ atomicrmw monotonic - singlethread - local 1. ds_atomic
+ - wavefront
+ - workgroup
+ **Acquire Atomic**
+ ------------------------------------------------------------------------------------
+ load atomic acquire - singlethread - global 1. buffer/global/ds/flat_load
+ - wavefront - local
+ - generic
+ load atomic acquire - workgroup - global 1. buffer/global_load ``scope:SCOPE_SE``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Must happen before
+ the following ``global_inv``
+ and before any following
+ global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+
+ 3. ``global_inv scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ load atomic acquire - workgroup - local 1. ds_load
+ 2. ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit.
+ - Must happen before
+ the following ``global_inv``
+ and before any following
+ global/generic load/load
+ atomic/store/store
+ atomic/atomicrmw.
+ - Ensures any
+ following global
+ data read is no
+ older than the local load
+ atomic value being
+ acquired.
+
+ 3. ``global_inv scope:SCOPE_SE``
+
+ - If OpenCL or CU wavefront
+ execution mode, omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ load atomic acquire - workgroup - generic 1. flat_load ``scope:SCOPE_SE``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``
+ - Must happen before
+ the following
+ ``global_inv`` and any
+ following global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+ - Ensures any
+ following global
+ data read is no
+ older than a local load
+ atomic value being
+ acquired.
+
+ 3. ``global_inv scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ load atomic acquire - agent - global 1. buffer/global_load
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+
+ - Must happen before
+ following
+ ``global_inv``.
+ - Ensures the load
+ has completed
+ before invalidating
+ the caches.
+
+ 3. ``global_inv scope:``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following
+ loads will not see
+ stale global data.
+
+ load atomic acquire - agent - generic 1. flat_load
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``
+ - Must happen before
+ following
+ ``global_inv``.
+ - Ensures the flat_load
+ has completed
+ before invalidating
+ the caches.
+
+ 3. ``global_inv scope:``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data.
+
+ atomicrmw acquire - singlethread - global 1. buffer/global/ds/flat_atomic
+ - wavefront - local
+ - generic
+ atomicrmw acquire - workgroup - global 1. buffer/global_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - If atomic with return,
+ use ``th:TH_ATOMIC_RETURN``
+
+ 2. | **Atomic with return:**
+ | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Must happen before
+ the following ``global_inv``
+ and before any following
+ global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+
+ 3. ``global_inv scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ atomicrmw acquire - workgroup - local 1. ds_atomic
+ 2. ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit.
+ - Must happen before
+ the following
+ ``global_inv``.
+ - Ensures any
+ following global
+ data read is no
+ older than the local
+ atomicrmw value
+ being acquired.
+
+ 3. ``global_inv scope:SCOPE_SE``
+
+ - If OpenCL omit.
+ - If CU wavefront execution
+ mode, omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ atomicrmw acquire - workgroup - generic 1. flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - If atomic with return,
+ use ``th:TH_ATOMIC_RETURN``
+
+ 2. | **Atomic with return:**
+ | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+
+ - If CU wavefront execution mode,
+ omit all for atomics without
+ return, and only emit
+ ``s_wait_dscnt 0x0`` for atomics
+ with return.
+ - If OpenCL, omit ``s_wait_dscnt 0x0``
+ - Must happen before
+ the following
+ ``global_inv``.
+ - Ensures any
+ following global
+ data read is no
+ older than a local
+ atomicrmw value
+ being acquired.
+
+ 3. ``global_inv scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ atomicrmw acquire - agent - global 1. buffer/global_atomic
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - If atomic with return,
+ use ``th:TH_ATOMIC_RETURN``
+
+ 2. | **Atomic with return:**
+ | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+
+ - Must happen before
+ following ``global_inv``.
+ - Ensures the
+ atomicrmw has
+ completed before
+ invalidating the
+ caches.
+
+ 3. ``global_inv scope:``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data.
+
+ atomicrmw acquire - agent - generic 1. flat_atomic
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - If atomic with return,
+ use ``th:TH_ATOMIC_RETURN``
+
+ 2. | **Atomic with return:**
+ | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit dscnt
+ - Must happen before
+ following
+ global_inv
+ - Ensures the
+ atomicrmw has
+ completed before
+ invalidating the
+ caches.
+
+ 3. ``global_inv scope:``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data.
+
+ fence acquire - singlethread *none* *none*
+ - wavefront
+ fence acquire - workgroup *none* 1. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``
+ - If OpenCL and address space is local,
+ omit all.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load
+ atomic/
+ atomicrmw-with-return-value
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ atomicrmw-no-return-value
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic load
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - Must happen before
+ the following
+ ``global_inv``.
+ - Ensures that the
+ fence-paired atomic
+ has completed
+ before invalidating
+ the
+ cache. Therefore
+ any following
+ locations read must
+ be no older than
+ the value read by
+ the
+ fence-paired-atomic.
+
+ 2. ``global_inv scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ fence acquire - agent *none* 1. | ``s_wait_bvhcnt 0x0``
+ - system | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - If OpenCL and address space is
+ local, omit all.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load
+ atomic/
+ atomicrmw-with-return-value
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ atomicrmw-no-return-value
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic load
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - Must happen before
+ the following
+ ``global_inv``
+ - Ensures that the
+ fence-paired atomic
+ has completed
+ before invalidating the
+ caches. Therefore
+ any following
+ locations read must
+ be no older than
+ the value read by
+ the
+ fence-paired-atomic.
+
+ 2. ``global_inv``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ **Release Atomic**
+ ------------------------------------------------------------------------------------
+ store atomic release - singlethread - global 1. buffer/global/ds/flat_store
+ - wavefront - local
+ - generic
+ store atomic release - workgroup - global 1. ``global_wb scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at workgroup
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ store that is being
+ released.
+
+ 2. buffer/global/flat_store
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+
+ store atomic release - workgroup - local 1. ``global_wb scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode or OpenCL, omit.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at workgroup
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``.
+ - Must happen before the
+ following store.
+ - Ensures that all
+ global memory
+ operations have
+ completed before
+ performing the
+ store that is being
+ released.
+
+ 3. ds_store
+ store atomic release - agent - global 1. ``global_wb scope:``
+ - system - generic
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at agent or system
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before the
+ following store.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ store that is being
+ released.
+
+ 3. buffer/global/flat_store
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+
+ atomicrmw release - singlethread - global 1. buffer/global/ds/flat_atomic
+ - wavefront - local
+ - generic
+ atomicrmw release - workgroup - global 1. ``global_wb scope:SCOPE_SE``
+ - generic
+ - If CU wavefront execution
+ mode, omit.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at workgroup
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - If OpenCL and CU wavefront
+ execution mode, omit all.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before the
+ following atomic.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ atomicrmw that is
+ being released.
+
+ 3. buffer/global/flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+
+ atomicrmw release - workgroup - local 1. ``global_wb scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode or OpenCL, omit.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at workgroup
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit all.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``.
+ - Must happen before the
+ following atomic.
+ - Ensures that all
+ global memory
+ operations have
+ completed before
+ performing the
+ store that is being
+ released.
+
+ 3. ds_atomic
+ atomicrmw release - agent - global 1. ``global_wb scope:``
+ - system - generic
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at agent or system
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before the
+ following atomic.
+ - Ensures that all
+ memory operations
+ to global and local
+ have completed
+ before performing
+ the atomicrmw that
+ is being released.
+
+ 3. buffer/global/flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+
+ fence release - singlethread *none* *none*
+ - wavefront
+ fence release - workgroup *none* 1. ``global_wb scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at workgroup
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - If OpenCL and
+ address space is
+ local, omit all.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store atomic/
+ atomicrmw.
+ - Must happen before
+ any following store
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ following
+ fence-paired-atomic.
+
+ fence release - agent *none* 1. ``global_wb scope:``
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at agent or system
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **OpenCL:**
+ | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+
+ - If OpenCl, omit ``s_wait_dscnt 0x0``.
+ - If OpenCL and address space is local,
+ omit all.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ any following store
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ fence-paired-atomic).
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ following
+ fence-paired-atomic.
+
+ **Acquire-Release Atomic**
+ ------------------------------------------------------------------------------------
+ atomicrmw acq_rel - singlethread - global 1. buffer/global/ds/flat_atomic
+ - wavefront - local
+ - generic
+ atomicrmw acq_rel - workgroup - global 1. ``global_wb scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at workgroup
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``.
+ - Must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``.
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ atomicrmw.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ atomicrmw that is
+ being released.
+
+ 3. buffer/global_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - If atomic with return, use
+ ``th:TH_ATOMIC_RETURN``.
+
+ 4. | **Atomic with return:**
+ | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Must happen before
+ the following
+ ``global_inv``.
+ - Ensures any
+ following global
+ data read is no
+ older than the
+ atomicrmw value
+ being acquired.
+
+ 5. ``global_inv scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ atomicrmw acq_rel - workgroup - local 1. ``global_wb scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode or OpenCL, omit.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at workgroup
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``
+ - Must happen before
+ the following
+ store.
+ - Ensures that all
+ global memory
+ operations have
+ completed before
+ performing the
+ store that is being
+ released.
+
+ 3. ds_atomic
+ 4. ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit.
+ - Must happen before
+ the following
+ ``global_inv``.
+ - Ensures any
+ following global
+ data read is no
+ older than the local load
+ atomic value being
+ acquired.
+
+ 5. ``global_inv scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - If OpenCL omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ atomicrmw acq_rel - workgroup - generic 1. ``global_wb scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode or OpenCL, omit.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at workgroup
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_loadcnt 0x0``.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ atomicrmw.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ atomicrmw that is
+ being released.
+
+ 3. flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - If atomic with return,
+ use ``th:TH_ATOMIC_RETURN``.
+
+ 4. | **Atomic without return:**
+ | ``s_wait_dscnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | **Atomic with return:**
+ | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit ``s_wait_dscnt 0x0``
+ - Must happen before
+ the following
+ ``global_inv``.
+ - Ensures any
+ following global
+ data read is no
+ older than the load
+ atomic value being
+ acquired.
+
+ 5. ``global_inv scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ atomicrmw acq_rel - agent - global 1. ``global_wb scope:``
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at agent or system
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit
+ ``s_wait_dscnt 0x0``
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ atomicrmw.
+ - Ensures that all
+ memory operations
+ to global have
+ completed before
+ performing the
+ atomicrmw that is
+ being released.
+
+ 3. buffer/global_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - If atomic with return, use
+ ``th:TH_ATOMIC_RETURN``.
+
+ 4. | **Atomic with return:**
+ | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+
+ - Must happen before
+ following
+ ``global_inv``.
+ - Ensures the
+ atomicrmw has
+ completed before
+ invalidating the
+ caches.
+
+ 5. ``global_inv scope:``
+
+ - If agent scope, ``scope:SCOPE_DEV``
+ - If system scope, ``scope:SCOPE_SYS``
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data.
+
+ atomicrmw acq_rel - agent - generic 1. ``global_wb scope:``
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at agent or system
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit
+ ``s_wait_dscnt 0x0``
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load atomic
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ atomicrmw.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing the
+ atomicrmw that is
+ being released.
+
+ 3. flat_atomic
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - If atomic with return, use
+ ``th:TH_ATOMIC_RETURN``.
+
+ 4. | **Atomic with return:**
+ | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_sampleecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **Atomic without return:**
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+
+ - If OpenCL, omit
+ ``s_wait_dscnt 0x0``.
+ - Must happen before
+ following
+ ``global_inv``.
+ - Ensures the
+ atomicrmw has
+ completed before
+ invalidating the
+ caches.
+
+ 5. ``global_inv scope:``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data.
+
+ fence acq_rel - singlethread *none* *none*
+ - wavefront
+ fence acq_rel - workgroup *none* 1. ``global_wb scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at workgroup
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL and
+ address space is
+ not generic, omit
+ ``s_wait_dscnt 0x0``
+ - If OpenCL and
+ address space is
+ local, omit
+ all but ``s_wait_dscnt 0x0``.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store atomic/
+ atomicrmw.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+ - Ensures that all
+ memory operations
+ have
+ completed before
+ performing any
+ following global
+ memory operations.
+ - Ensures that the
+ preceding
+ local/generic load
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ acquire-fence-paired-atomic)
+ has completed
+ before following
+ global memory
+ operations. This
+ satisfies the
+ requirements of
+ acquire.
+ - Ensures that all
+ previous memory
+ operations have
+ completed before a
+ following
+ local/generic store
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ release-fence-paired-atomic).
+ This satisfies the
+ requirements of
+ release.
+ - Must happen before
+ the following
+ ``global_inv``.
+ - Ensures that the
+ acquire-fence-paired
+ atomic has completed
+ before invalidating
+ the
+ cache. Therefore
+ any following
+ locations read must
+ be no older than
+ the value read by
+ the
+ acquire-fence-paired-atomic.
+
+ 3. ``global_inv scope:SCOPE_SE``
+
+ - If CU wavefront execution
+ mode, omit.
+ - Ensures that
+ following
+ loads will not see
+ stale data.
+
+ fence acq_rel - agent *none* 1. ``global_wb scope:``
+ - system
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - In combination with the waits
+ below, ensures that all
+ memory operations
+ have completed at agent or system
+ scope before performing the
+ store that is being
+ released.
+
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL and
+ address space is
+ not generic, omit
+ ``s_wait_dscnt 0x0``
+ - If OpenCL and
+ address space is
+ local, omit
+ all but ``s_wait_dscnt 0x0``.
+ - See :ref:`amdgpu-fence-as` for
+ more details on fencing specific
+ address spaces.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ any preceding
+ global/generic
+ load/load
+ atomic/
+ atomicrmw-with-return-value.
+ - ``s_wait_storecnt 0x0``
+ must happen after
+ ``global_wb``
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ any preceding
+ local/generic
+ load/store/load
+ atomic/store
+ atomic/atomicrmw.
+ - Must happen before
+ the following
+ ``global_inv``
+ - Ensures that the
+ preceding
+ global/local/generic
+ load
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ acquire-fence-paired-atomic)
+ has completed
+ before invalidating
+ the caches. This
+ satisfies the
+ requirements of
+ acquire.
+ - Ensures that all
+ previous memory
+ operations have
+ completed before a
+ following
+ global/local/generic
+ store
+ atomic/atomicrmw
+ with an equal or
+ wider sync scope
+ and memory ordering
+ stronger than
+ unordered (this is
+ termed the
+ release-fence-paired-atomic).
+ This satisfies the
+ requirements of
+ release.
+
+ 3. ``global_inv scope:``
+
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+ - Must happen before
+ any following
+ global/generic
+ load/load
+ atomic/store/store
+ atomic/atomicrmw.
+ - Ensures that
+ following loads
+ will not see stale
+ global data. This
+ satisfies the
+ requirements of
+ acquire.
+
+ **Sequential Consistent Atomic**
+ ------------------------------------------------------------------------------------
+ load atomic seq_cst - singlethread - global *Same as corresponding
+ - wavefront - local load atomic acquire,
+ - generic except must generate
+ all instructions even
+ for OpenCL.*
+ load atomic seq_cst - workgroup - global 1. | ``s_wait_bvhcnt 0x0``
+ - generic | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit
+ ``s_wait_dscnt 0x0``
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_dscnt 0x0`` must
+ happen after
+ preceding
+ local/generic load
+ atomic/store
+ atomic/atomicrmw
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait_dscnt 0x0``
+ and so do not need to be
+ considered.)
+ - ``s_wait_loadcnt 0x0``\,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ preceding
+ global/generic load
+ atomic/
+ atomicrmw-with-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own waits and so do
+ not need to be
+ considered.)
+ - ``s_wait_storecnt 0x0``
+ Must happen after
+ preceding
+ global/generic store
+ atomic/
+ atomicrmw-no-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait_storecnt 0x0``
+ and so do not need to be
+ considered.)
+ - Ensures any
+ preceding
+ sequential
+ consistent global/local
+ memory instructions
+ have completed
+ before executing
+ this sequentially
+ consistent
+ instruction. This
+ prevents reordering
+ a seq_cst store
+ followed by a
+ seq_cst load. (Note
+ that seq_cst is
+ stronger than
+ acquire/release as
+ the reordering of
+ load acquire
+ followed by a store
+ release is
+ prevented by the
+ ``s_wait``\s of
+ the release, but
+ there is nothing
+ preventing a store
+ release followed by
+ load acquire from
+ completing out of
+ order. The ``s_wait``\s
+ could be placed after
+ seq_store or before
+ the seq_load. We
+ choose the load to
+ make the ``s_wait``\s be
+ as late as possible
+ so that the store
+ may have already
+ completed.)
+
+ 2. *Following
+ instructions same as
+ corresponding load
+ atomic acquire,
+ except must generate
+ all instructions even
+ for OpenCL.*
+ load atomic seq_cst - workgroup - local 1. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit all.
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_loadcnt 0x0``\,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ Must happen after
+ preceding
+ global/generic load
+ atomic/
+ atomicrmw-with-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait``\s and so do
+ not need to be
+ considered.)
+ - ``s_wait_storecnt 0x0``
+ Must happen after
+ preceding
+ global/generic store
+ atomic/
+ atomicrmw-no-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait_storecnt 0x0``
+ and so do
+ not need to be
+ considered.)
+ - Ensures any
+ preceding
+ sequential
+ consistent global
+ memory instructions
+ have completed
+ before executing
+ this sequentially
+ consistent
+ instruction. This
+ prevents reordering
+ a seq_cst store
+ followed by a
+ seq_cst load. (Note
+ that seq_cst is
+ stronger than
+ acquire/release as
+ the reordering of
+ load acquire
+ followed by a store
+ release is
+ prevented by the
+ ``s_wait``\s of
+ the release, but
+ there is nothing
+ preventing a store
+ release followed by
+ load acquire from
+ completing out of
+ order. The s_waitcnt
+ could be placed after
+ seq_store or before
+ the seq_load. We
+ choose the load to
+ make the ``s_wait``\s be
+ as late as possible
+ so that the store
+ may have already
+ completed.)
+
+ 2. *Following
+ instructions same as
+ corresponding load
+ atomic acquire,
+ except must generate
+ all instructions even
+ for OpenCL.*
+
+ load atomic seq_cst - agent - global 1. | ``s_wait_bvhcnt 0x0``
+ - system - generic | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+
+ - If OpenCL, omit
+ ``s_wait_dscnt 0x0``
+ - The waits can be
+ independently moved
+ according to the
+ following rules:
+ - ``s_wait_dscnt 0x0``
+ must happen after
+ preceding
+ local load
+ atomic/store
+ atomic/atomicrmw
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait_dscnt 0x0``
+ and so do
+ not need to be
+ considered.)
+ - ``s_wait_loadcnt 0x0``\,
+ ``s_wait_samplecnt 0x0`` and
+ ``s_wait_bvhcnt 0x0``
+ must happen after
+ preceding
+ global/generic load
+ atomic/
+ atomicrmw-with-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own ``s_wait``\s and so do
+ not need to be
+ considered.)
+ - ``s_wait_storecnt 0x0``
+ Must happen after
+ preceding
+ global/generic store
+ atomic/
+ atomicrmw-no-return-value
+ with memory
+ ordering of seq_cst
+ and with equal or
+ wider sync scope.
+ (Note that seq_cst
+ fences have their
+ own
+ ``s_wait_storecnt 0x0`` and so do
+ not need to be
+ considered.)
+ - Ensures any
+ preceding
+ sequential
+ consistent global
+ memory instructions
+ have completed
+ before executing
+ this sequentially
+ consistent
+ instruction. This
+ prevents reordering
+ a seq_cst store
+ followed by a
+ seq_cst load. (Note
+ that seq_cst is
+ stronger than
+ acquire/release as
+ the reordering of
+ load acquire
+ followed by a store
+ release is
+ prevented by the
+ ``s_wait``\s of
+ the release, but
+ there is nothing
+ preventing a store
+ release followed by
+ load acquire from
+ completing out of
+ order. The ``s_wait``\s
+ could be placed after
+ seq_store or before
+ the seq_load. We
+ choose the load to
+ make the ``s_wait``\s be
+ as late as possible
+ so that the store
+ may have already
+ completed.)
+
+ 2. *Following
+ instructions same as
+ corresponding load
+ atomic acquire,
+ except must generate
+ all instructions even
+ for OpenCL.*
+ store atomic seq_cst - singlethread - global *Same as corresponding
+ - wavefront - local store atomic release,
+ - workgroup - generic except must generate
+ - agent all instructions even
+ - system for OpenCL.*
+ atomicrmw seq_cst - singlethread - global *Same as corresponding
+ - wavefront - local atomicrmw acq_rel,
+ - workgroup - generic except must generate
+ - agent all instructions even
+ - system for OpenCL.*
+ fence seq_cst - singlethread *none* *Same as corresponding
+ - wavefront fence acq_rel,
+ - workgroup except must generate
+ - agent all instructions even
+ - system for OpenCL.*
+ ============ ============ ============== ========== ================================
+
.. _amdgpu-amdhsa-trap-handler-abi:
Trap Handler ABI
>From 3734fe2bf98b7b6d2bae0d44a3f7db81e6fe9439 Mon Sep 17 00:00:00 2001
From: pvanhout <pierre.vanhoutryve at amd.com>
Date: Fri, 19 Jul 2024 09:18:45 +0200
Subject: [PATCH 2/6] L1 is now a buffer + other small fix
---
llvm/docs/AMDGPUUsage.rst | 17 +++++++----------
1 file changed, 7 insertions(+), 10 deletions(-)
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 816f7fb00ee98e..90a23a7f59b007 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -14159,19 +14159,16 @@ For GFX12:
work-group:
* In CU wavefront execution mode, no special action is required.
- * In WGP wavefront execution mode, a ``global_inv scope:SCOPE_CU`` is required
+ * In WGP wavefront execution mode, a ``global_inv scope:SCOPE_SE`` is required
as wavefronts may be executing on SIMDs of different CUs that access different L0s.
* The scalar memory operations access a scalar L0 cache shared by all wavefronts
on a WGP. The scalar and vector L0 caches are not coherent. However, scalar
operations are used in a restricted way so do not impact the memory model. See
:ref:`amdgpu-amdhsa-memory-spaces`.
-* The vector and scalar memory L0 caches use an L1 cache shared by all WGPs on
- the same SA. Therefore, no special action is required for coherence between
- the wavefronts of a single work-group. However, a ``global_inv scope:SCOPE_DEV`` is
- required for coherence between wavefronts executing in different work-groups
- as they may be executing on different SAs that access different L1s.
-* The L1 caches have independent quadrants to service disjoint ranges of virtual
+* The vector and scalar memory L0 caches use an L1 buffer shared by all WGPs on
+ the same SA. The L1 buffer acts as a bridge to L2 for clients within a SA.
+* The L1 buffers have independent quadrants to service disjoint ranges of virtual
addresses.
* Each L0 cache has a separate request queue per L1 quadrant. Therefore, the
vector and scalar memory operations performed by different wavefronts, whether
@@ -14188,7 +14185,7 @@ For GFX12:
* ``s_wait_bvhcnt 0x0``
* ``s_wait_storecnt 0x0``
-* The L1 caches use an L2 cache shared by all SAs on the same agent.
+* The L1 buffers use an L2 cache shared by all SAs on the same agent.
* The L2 cache has independent channels to service disjoint ranges of virtual
addresses.
* Each L1 quadrant of a single SA accesses a different L2 channel. Each L1
@@ -14223,7 +14220,7 @@ may change between kernel dispatch executions. See
For kernarg backing memory:
-* CP invalidates the L0 and L1 caches at the start of each kernel dispatch.
+* CP invalidates caches start of each kernel dispatch.
* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
needing to invalidate the L2 cache.
* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
@@ -14232,7 +14229,7 @@ For kernarg backing memory:
Scratch backing memory (which is used for the private address space) is accessed
with MTYPE NC (non-coherent). Since the private address space is only accessed
by a single thread, and is always write-before-read, there is never a need to
-invalidate these entries from the L0 or L1 caches.
+invalidate these entries from L0.
Wavefronts can be executed in WGP or CU wavefront execution mode:
>From ae1aa5b9f2d0692ca6b97d50c5b8ced5fed9b7de Mon Sep 17 00:00:00 2001
From: pvanhout <pierre.vanhoutryve at amd.com>
Date: Fri, 2 Aug 2024 08:42:22 +0200
Subject: [PATCH 3/6] Try to make the synscope table clearer
---
llvm/docs/AMDGPUUsage.rst | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 90a23a7f59b007..5193744869c792 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -14251,6 +14251,13 @@ See ``WGP_MODE`` field in
The code sequences used to implement the memory model for GFX12 are defined in
table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`.
+The mapping of LLVM IR syncscope to GFX12 instruction ``scope`` operands is
+defined in :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
+
+The table only applies if and only if it is directly referenced by an entry in
+:ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`, and it only applies to
+the instruction in the code sequence that references the table.
+
.. table:: AMDHSA Memory Model Code Sequences GFX12 - Instruction Scopes
:name: amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table
@@ -14273,9 +14280,6 @@ table :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`.
singlethread-one-as *none* *none*
=================== =================== ===================
-NOTE: The table above applies if and only if it is explicitly referenced by
-a code sequence in :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-table`.
-
.. table:: AMDHSA Memory Model Code Sequences GFX12
:name: amdgpu-amdhsa-memory-model-code-sequences-gfx12-table
>From 6ccf48a4e052d17b99a7b75ccd7432942726c717 Mon Sep 17 00:00:00 2001
From: pvanhout <pierre.vanhoutryve at amd.com>
Date: Mon, 19 Aug 2024 10:11:24 +0200
Subject: [PATCH 4/6] Fix waitcnts
---
llvm/docs/AMDGPUUsage.rst | 83 ++++++++++++++-------------------------
1 file changed, 29 insertions(+), 54 deletions(-)
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 5193744869c792..27b8afeae6bcea 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -14304,9 +14304,7 @@ the instruction in the code sequence that references the table.
1. buffer/global/flat_load
``scope:SCOPE_SYS``
- 2. ``s_wait_bvhcnt 0x0``
- ``s_wait_samplecnt 0x0``
- ``s_wait_loadcnt 0x0``
+ 2. ``s_wait_loadcnt 0x0``
- Must happen before
any following volatile
@@ -14390,9 +14388,7 @@ the instruction in the code sequence that references the table.
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- 2. | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
- | ``s_wait_loadcnt 0x0``
+ 2. ``s_wait_loadcnt 0x0``
- If CU wavefront execution
mode, omit.
@@ -14439,13 +14435,11 @@ the instruction in the code sequence that references the table.
loads will not see
stale data.
- load atomic acquire - workgroup - generic 1. flat_load ``scope:SCOPE_SE``
+ load atomic acquire - workgroup - generic 1. flat_load
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- 2. | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
- | ``s_wait_loadcnt 0x0``
+ 2. | ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
| **CU wavefront execution mode:**
| ``s_wait_dscnt 0x0``
@@ -14478,9 +14472,7 @@ the instruction in the code sequence that references the table.
- system
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- 2. | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
- | ``s_wait_loadcnt 0x0``
+ 2. ``s_wait_loadcnt 0x0``
- Must happen before
following
@@ -14490,7 +14482,7 @@ the instruction in the code sequence that references the table.
before invalidating
the caches.
- 3. ``global_inv scope:``
+ 3. ``global_inv``
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- Must happen before
@@ -14507,9 +14499,7 @@ the instruction in the code sequence that references the table.
- system
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- 2. | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
- | ``s_wait_loadcnt 0x0``
+ 2. | ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
- If OpenCL, omit ``s_wait_dscnt 0x0``
@@ -14521,7 +14511,7 @@ the instruction in the code sequence that references the table.
before invalidating
the caches.
- 3. ``global_inv scope:``
+ 3. ``global_inv``
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- Must happen before
@@ -14544,8 +14534,6 @@ the instruction in the code sequence that references the table.
use ``th:TH_ATOMIC_RETURN``
2. | **Atomic with return:**
- | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
| ``s_wait_loadcnt 0x0``
| **Atomic without return:**
| ``s_wait_storecnt 0x0``
@@ -14600,12 +14588,11 @@ the instruction in the code sequence that references the table.
use ``th:TH_ATOMIC_RETURN``
2. | **Atomic with return:**
- | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
| **Atomic without return:**
| ``s_wait_storecnt 0x0``
+ | ``s_wait_dscnt 0x0``
- If CU wavefront execution mode,
omit all for atomics without
@@ -14639,8 +14626,6 @@ the instruction in the code sequence that references the table.
use ``th:TH_ATOMIC_RETURN``
2. | **Atomic with return:**
- | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
| ``s_wait_loadcnt 0x0``
| **Atomic without return:**
| ``s_wait_storecnt 0x0``
@@ -14653,7 +14638,7 @@ the instruction in the code sequence that references the table.
invalidating the
caches.
- 3. ``global_inv scope:``
+ 3. ``global_inv``
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- Must happen before
@@ -14673,8 +14658,6 @@ the instruction in the code sequence that references the table.
use ``th:TH_ATOMIC_RETURN``
2. | **Atomic with return:**
- | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
| **Atomic without return:**
@@ -14691,7 +14674,7 @@ the instruction in the code sequence that references the table.
invalidating the
caches.
- 3. ``global_inv scope:``
+ 3. ``global_inv``
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- Must happen before
@@ -14883,13 +14866,13 @@ the instruction in the code sequence that references the table.
store that is being
released.
- 2. | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
- | ``s_wait_storecnt 0x0``
- | ``s_wait_loadcnt 0x0``
- | ``s_wait_dscnt 0x0``
- | **CU wavefront execution mode:**
- | ``s_wait_dscnt 0x0``
+ 2. | ``s_wait_bvhcnt 0x0``
+ | ``s_wait_samplecnt 0x0``
+ | ``s_wait_storecnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
+ | **CU wavefront execution mode:**
+ | ``s_wait_dscnt 0x0``
- If OpenCL, omit ``s_wait_dscnt 0x0``.
- The waits can be
@@ -14922,7 +14905,7 @@ the instruction in the code sequence that references the table.
store that is being
released.
- 2. buffer/global/flat_store
+ 3. buffer/global/flat_store
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
@@ -14973,7 +14956,7 @@ the instruction in the code sequence that references the table.
released.
3. ds_store
- store atomic release - agent - global 1. ``global_wb scope:``
+ store atomic release - agent - global 1. ``global_wb``
- system - generic
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- In combination with the waits
@@ -15261,7 +15244,7 @@ the instruction in the code sequence that references the table.
following
fence-paired-atomic.
- fence release - agent *none* 1. ``global_wb scope:``
+ fence release - agent *none* 1. ``global_wb``
- system
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- In combination with the waits
@@ -15401,8 +15384,6 @@ the instruction in the code sequence that references the table.
``th:TH_ATOMIC_RETURN``.
4. | **Atomic with return:**
- | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
| ``s_wait_loadcnt 0x0``
| **Atomic without return:**
| ``s_wait_storecnt 0x0``
@@ -15443,7 +15424,8 @@ the instruction in the code sequence that references the table.
2. | ``s_wait_bvhcnt 0x0``
| ``s_wait_samplecnt 0x0``
| ``s_wait_storecnt 0x0``
- | ``s_wait_loadcnt_dscnt 0x0``
+ | ``s_wait_loadcnt 0x0``
+ | ``s_wait_dscnt 0x0``
| **CU wavefront execution mode:**
| ``s_wait_dscnt 0x0``
@@ -15562,8 +15544,6 @@ the instruction in the code sequence that references the table.
| ``s_wait_dscnt 0x0``
| ``s_wait_storecnt 0x0``
| **Atomic with return:**
- | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
| **CU wavefront execution mode:**
@@ -15589,7 +15569,7 @@ the instruction in the code sequence that references the table.
loads will not see
stale data.
- atomicrmw acq_rel - agent - global 1. ``global_wb scope:``
+ atomicrmw acq_rel - agent - global 1. ``global_wb``
- system
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- In combination with the waits
@@ -15648,8 +15628,6 @@ the instruction in the code sequence that references the table.
``th:TH_ATOMIC_RETURN``.
4. | **Atomic with return:**
- | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
| ``s_wait_loadcnt 0x0``
| **Atomic without return:**
| ``s_wait_storecnt 0x0``
@@ -15663,10 +15641,9 @@ the instruction in the code sequence that references the table.
invalidating the
caches.
- 5. ``global_inv scope:``
+ 5. ``global_inv``
- - If agent scope, ``scope:SCOPE_DEV``
- - If system scope, ``scope:SCOPE_SYS``
+ - Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- Must happen before
any following
global/generic
@@ -15677,7 +15654,7 @@ the instruction in the code sequence that references the table.
will not see stale
global data.
- atomicrmw acq_rel - agent - generic 1. ``global_wb scope:``
+ atomicrmw acq_rel - agent - generic 1. ``global_wb``
- system
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- In combination with the waits
@@ -15736,8 +15713,6 @@ the instruction in the code sequence that references the table.
``th:TH_ATOMIC_RETURN``.
4. | **Atomic with return:**
- | ``s_wait_bvhcnt 0x0``
- | ``s_wait_sampleecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
| **Atomic without return:**
@@ -15756,7 +15731,7 @@ the instruction in the code sequence that references the table.
invalidating the
caches.
- 5. ``global_inv scope:``
+ 5. ``global_inv``
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- Must happen before
@@ -15898,7 +15873,7 @@ the instruction in the code sequence that references the table.
loads will not see
stale data.
- fence acq_rel - agent *none* 1. ``global_wb scope:``
+ fence acq_rel - agent *none* 1. ``global_wb``
- system
- Apply :ref:`amdgpu-amdhsa-memory-model-code-sequences-gfx12-scopes-table`.
- In combination with the waits
>From 0ade58aaae5fef9655f7f48e82bd4f3f7a30633a Mon Sep 17 00:00:00 2001
From: pvanhout <pierre.vanhoutryve at amd.com>
Date: Mon, 19 Aug 2024 10:35:36 +0200
Subject: [PATCH 5/6] additional fixes
---
llvm/docs/AMDGPUUsage.rst | 30 +++++++++++++++++-------------
1 file changed, 17 insertions(+), 13 deletions(-)
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 27b8afeae6bcea..e35a81daa01b2e 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -14220,7 +14220,7 @@ may change between kernel dispatch executions. See
For kernarg backing memory:
-* CP invalidates caches start of each kernel dispatch.
+* CP invalidates caches at the start of each kernel dispatch.
* On dGPU the kernarg backing memory is accessed as MTYPE UC (uncached) to avoid
needing to invalidate the L2 cache.
* On APU the kernarg backing memory is accessed as MTYPE CC (cache coherent) and
@@ -14689,9 +14689,7 @@ the instruction in the code sequence that references the table.
fence acquire - singlethread *none* *none*
- wavefront
- fence acquire - workgroup *none* 1. | ``s_wait_bvhcnt 0x0``
- | ``s_wait_samplecnt 0x0``
- | ``s_wait_storecnt 0x0``
+ fence acquire - workgroup *none* 1. | ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
| **CU wavefront execution mode:**
@@ -14703,13 +14701,17 @@ the instruction in the code sequence that references the table.
- See :ref:`amdgpu-fence-as` for
more details on fencing specific
address spaces.
+ - Note: we don't have to use
+ ``s_wait_samplecnt 0x0`` or
+ ``s_wait_bvhcnt 0x0`` because
+ there are no atomic sample or
+ BVH instructions that the fence
+ could pair with.
- The waits can be
independently moved
according to the
following rules:
- - ``s_wait_loadcnt 0x0``,
- ``s_wait_samplecnt 0x0`` and
- ``s_wait_bvhcnt 0x0``
+ - ``s_wait_loadcnt 0x0``
must happen after
any preceding
global/generic load
@@ -14771,9 +14773,7 @@ the instruction in the code sequence that references the table.
loads will not see
stale data.
- fence acquire - agent *none* 1. | ``s_wait_bvhcnt 0x0``
- - system | ``s_wait_samplecnt 0x0``
- | ``s_wait_storecnt 0x0``
+ fence acquire - agent *none* 1. | ``s_wait_storecnt 0x0``
| ``s_wait_loadcnt 0x0``
| ``s_wait_dscnt 0x0``
@@ -14783,13 +14783,17 @@ the instruction in the code sequence that references the table.
- See :ref:`amdgpu-fence-as` for
more details on fencing specific
address spaces.
+ - Note: we don't have to use
+ ``s_wait_samplecnt 0x0`` or
+ ``s_wait_bvhcnt 0x0`` because
+ there are no atomic sample or
+ BVH instructions that the fence
+ could pair with.
- The waits can be
independently moved
according to the
following rules:
- - ``s_wait_loadcnt 0x0``,
- ``s_wait_samplecnt 0x0`` and
- ``s_wait_bvhcnt 0x0``
+ - ``s_wait_loadcnt 0x0``
must happen after
any preceding
global/generic load
>From d88044985670a81ed307ce8e3a4b622dea650d75 Mon Sep 17 00:00:00 2001
From: pvanhout <pierre.vanhoutryve at amd.com>
Date: Mon, 19 Aug 2024 12:10:43 +0200
Subject: [PATCH 6/6] Add code changes
---
llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp | 176 +++++++++---------
.../memory-legalizer-fence-mmra-global.ll | 20 --
.../CodeGen/AMDGPU/memory-legalizer-fence.ll | 20 --
.../AMDGPU/memory-legalizer-flat-agent.ll | 40 ----
.../AMDGPU/memory-legalizer-flat-system.ll | 40 ----
.../AMDGPU/memory-legalizer-flat-volatile.ll | 2 -
.../AMDGPU/memory-legalizer-flat-workgroup.ll | 20 --
.../AMDGPU/memory-legalizer-global-agent.ll | 40 ----
.../AMDGPU/memory-legalizer-global-system.ll | 40 ----
.../memory-legalizer-global-volatile.ll | 2 -
.../memory-legalizer-global-workgroup.ll | 20 --
11 files changed, 89 insertions(+), 331 deletions(-)
diff --git a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
index bd4203ccd6fe4e..1acc4cedd026a9 100644
--- a/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
+++ b/llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp
@@ -330,12 +330,10 @@ class SICacheControl {
/// observed by other memory instructions executing in memory scope \p Scope.
/// \p IsCrossAddrSpaceOrdering indicates if the memory ordering is between
/// address spaces. Returns true iff any instructions inserted.
- virtual bool insertWait(MachineBasicBlock::iterator &MI,
- SIAtomicScope Scope,
- SIAtomicAddrSpace AddrSpace,
- SIMemOp Op,
- bool IsCrossAddrSpaceOrdering,
- Position Pos) const = 0;
+ virtual bool insertWait(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
+ SIAtomicAddrSpace AddrSpace, SIMemOp Op,
+ bool IsCrossAddrSpaceOrdering, Position Pos,
+ AtomicOrdering Order) const = 0;
/// Inserts any necessary instructions at position \p Pos relative to
/// instruction \p MI to ensure any subsequent memory instructions of this
@@ -404,12 +402,10 @@ class SIGfx6CacheControl : public SICacheControl {
bool IsVolatile, bool IsNonTemporal,
bool IsLastUse) const override;
- bool insertWait(MachineBasicBlock::iterator &MI,
- SIAtomicScope Scope,
- SIAtomicAddrSpace AddrSpace,
- SIMemOp Op,
- bool IsCrossAddrSpaceOrdering,
- Position Pos) const override;
+ bool insertWait(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
+ SIAtomicAddrSpace AddrSpace, SIMemOp Op,
+ bool IsCrossAddrSpaceOrdering, Position Pos,
+ AtomicOrdering Order) const override;
bool insertAcquire(MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,
@@ -457,12 +453,10 @@ class SIGfx90ACacheControl : public SIGfx7CacheControl {
bool IsVolatile, bool IsNonTemporal,
bool IsLastUse) const override;
- bool insertWait(MachineBasicBlock::iterator &MI,
- SIAtomicScope Scope,
- SIAtomicAddrSpace AddrSpace,
- SIMemOp Op,
- bool IsCrossAddrSpaceOrdering,
- Position Pos) const override;
+ bool insertWait(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
+ SIAtomicAddrSpace AddrSpace, SIMemOp Op,
+ bool IsCrossAddrSpaceOrdering, Position Pos,
+ AtomicOrdering Order) const override;
bool insertAcquire(MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,
@@ -562,12 +556,10 @@ class SIGfx10CacheControl : public SIGfx7CacheControl {
bool IsVolatile, bool IsNonTemporal,
bool IsLastUse) const override;
- bool insertWait(MachineBasicBlock::iterator &MI,
- SIAtomicScope Scope,
- SIAtomicAddrSpace AddrSpace,
- SIMemOp Op,
- bool IsCrossAddrSpaceOrdering,
- Position Pos) const override;
+ bool insertWait(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
+ SIAtomicAddrSpace AddrSpace, SIMemOp Op,
+ bool IsCrossAddrSpaceOrdering, Position Pos,
+ AtomicOrdering Order) const override;
bool insertAcquire(MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,
@@ -617,7 +609,8 @@ class SIGfx12CacheControl : public SIGfx11CacheControl {
bool insertWait(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
SIAtomicAddrSpace AddrSpace, SIMemOp Op,
- bool IsCrossAddrSpaceOrdering, Position Pos) const override;
+ bool IsCrossAddrSpaceOrdering, Position Pos,
+ AtomicOrdering Order) const override;
bool insertAcquire(MachineBasicBlock::iterator &MI, SIAtomicScope Scope,
SIAtomicAddrSpace AddrSpace, Position Pos) const override;
@@ -1072,7 +1065,7 @@ bool SIGfx6CacheControl::enableVolatileAndOrNonTemporal(
// observable outside the program, so no need to cause a waitcnt for LDS
// address space operations.
Changed |= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,
- Position::AFTER);
+ Position::AFTER, AtomicOrdering::Unordered);
return Changed;
}
@@ -1090,10 +1083,9 @@ bool SIGfx6CacheControl::enableVolatileAndOrNonTemporal(
bool SIGfx6CacheControl::insertWait(MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,
- SIAtomicAddrSpace AddrSpace,
- SIMemOp Op,
- bool IsCrossAddrSpaceOrdering,
- Position Pos) const {
+ SIAtomicAddrSpace AddrSpace, SIMemOp Op,
+ bool IsCrossAddrSpaceOrdering, Position Pos,
+ AtomicOrdering Order) const {
bool Changed = false;
MachineBasicBlock &MBB = *MI->getParent();
@@ -1237,7 +1229,7 @@ bool SIGfx6CacheControl::insertRelease(MachineBasicBlock::iterator &MI,
bool IsCrossAddrSpaceOrdering,
Position Pos) const {
return insertWait(MI, Scope, AddrSpace, SIMemOp::LOAD | SIMemOp::STORE,
- IsCrossAddrSpaceOrdering, Pos);
+ IsCrossAddrSpaceOrdering, Pos, AtomicOrdering::Release);
}
bool SIGfx7CacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -1425,7 +1417,7 @@ bool SIGfx90ACacheControl::enableVolatileAndOrNonTemporal(
// observable outside the program, so no need to cause a waitcnt for LDS
// address space operations.
Changed |= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,
- Position::AFTER);
+ Position::AFTER, AtomicOrdering::Unordered);
return Changed;
}
@@ -1443,10 +1435,10 @@ bool SIGfx90ACacheControl::enableVolatileAndOrNonTemporal(
bool SIGfx90ACacheControl::insertWait(MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,
- SIAtomicAddrSpace AddrSpace,
- SIMemOp Op,
+ SIAtomicAddrSpace AddrSpace, SIMemOp Op,
bool IsCrossAddrSpaceOrdering,
- Position Pos) const {
+ Position Pos,
+ AtomicOrdering Order) const {
if (ST.isTgSplitEnabled()) {
// In threadgroup split mode the waves of a work-group can be executing on
// different CUs. Therefore need to wait for global or GDS memory operations
@@ -1466,7 +1458,7 @@ bool SIGfx90ACacheControl::insertWait(MachineBasicBlock::iterator &MI,
AddrSpace &= ~SIAtomicAddrSpace::LDS;
}
return SIGfx7CacheControl::insertWait(MI, Scope, AddrSpace, Op,
- IsCrossAddrSpaceOrdering, Pos);
+ IsCrossAddrSpaceOrdering, Pos, Order);
}
bool SIGfx90ACacheControl::insertAcquire(MachineBasicBlock::iterator &MI,
@@ -1725,7 +1717,7 @@ bool SIGfx940CacheControl::enableVolatileAndOrNonTemporal(
// observable outside the program, so no need to cause a waitcnt for LDS
// address space operations.
Changed |= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,
- Position::AFTER);
+ Position::AFTER, AtomicOrdering::Unordered);
return Changed;
}
@@ -1882,7 +1874,7 @@ bool SIGfx940CacheControl::insertRelease(MachineBasicBlock::iterator &MI,
// Ensure the necessary S_WAITCNT needed by any "BUFFER_WBL2" as well as other
// S_WAITCNT needed.
Changed |= insertWait(MI, Scope, AddrSpace, SIMemOp::LOAD | SIMemOp::STORE,
- IsCrossAddrSpaceOrdering, Pos);
+ IsCrossAddrSpaceOrdering, Pos, AtomicOrdering::Release);
return Changed;
}
@@ -1962,7 +1954,7 @@ bool SIGfx10CacheControl::enableVolatileAndOrNonTemporal(
// observable outside the program, so no need to cause a waitcnt for LDS
// address space operations.
Changed |= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,
- Position::AFTER);
+ Position::AFTER, AtomicOrdering::Unordered);
return Changed;
}
@@ -1983,10 +1975,9 @@ bool SIGfx10CacheControl::enableVolatileAndOrNonTemporal(
bool SIGfx10CacheControl::insertWait(MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,
- SIAtomicAddrSpace AddrSpace,
- SIMemOp Op,
+ SIAtomicAddrSpace AddrSpace, SIMemOp Op,
bool IsCrossAddrSpaceOrdering,
- Position Pos) const {
+ Position Pos, AtomicOrdering Order) const {
bool Changed = false;
MachineBasicBlock &MBB = *MI->getParent();
@@ -2234,7 +2225,7 @@ bool SIGfx11CacheControl::enableVolatileAndOrNonTemporal(
// observable outside the program, so no need to cause a waitcnt for LDS
// address space operations.
Changed |= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,
- Position::AFTER);
+ Position::AFTER, AtomicOrdering::Unordered);
return Changed;
}
@@ -2305,7 +2296,7 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
SIAtomicScope Scope,
SIAtomicAddrSpace AddrSpace, SIMemOp Op,
bool IsCrossAddrSpaceOrdering,
- Position Pos) const {
+ Position Pos, AtomicOrdering Order) const {
bool Changed = false;
MachineBasicBlock &MBB = *MI->getParent();
@@ -2375,8 +2366,21 @@ bool SIGfx12CacheControl::insertWait(MachineBasicBlock::iterator &MI,
}
if (LOADCnt) {
- BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_BVHCNT_soft)).addImm(0);
- BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_SAMPLECNT_soft)).addImm(0);
+ // Acquire sequences only need to wait on the previous atomic operation.
+ // e.g. a typical sequence looks like
+ // atomic load
+ // (wait)
+ // global_inv
+ //
+ // We do not have BVH or SAMPLE atomics, so the atomic load is always going
+ // to be tracked using loadcnt.
+ //
+ // This also applies to fences. Fences cannot pair with an instruction
+ // tracked with bvh/samplecnt as we don't have any atomics that do that.
+ if (Order != AtomicOrdering::Acquire) {
+ BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_BVHCNT_soft)).addImm(0);
+ BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_SAMPLECNT_soft)).addImm(0);
+ }
BuildMI(MBB, MI, DL, TII->get(AMDGPU::S_WAIT_LOADCNT_soft)).addImm(0);
Changed = true;
}
@@ -2514,7 +2518,7 @@ bool SIGfx12CacheControl::insertRelease(MachineBasicBlock::iterator &MI,
// complete, whether we inserted a WB or not. If we inserted a WB (storecnt),
// we of course need to wait for that as well.
insertWait(MI, Scope, AddrSpace, SIMemOp::LOAD | SIMemOp::STORE,
- IsCrossAddrSpaceOrdering, Pos);
+ IsCrossAddrSpaceOrdering, Pos, AtomicOrdering::Release);
return true;
}
@@ -2554,7 +2558,7 @@ bool SIGfx12CacheControl::enableVolatileAndOrNonTemporal(
// observable outside the program, so no need to cause a waitcnt for LDS
// address space operations.
Changed |= insertWait(MI, SIAtomicScope::SYSTEM, AddrSpace, Op, false,
- Position::AFTER);
+ Position::AFTER, AtomicOrdering::Unordered);
}
return Changed;
@@ -2625,27 +2629,25 @@ bool SIMemoryLegalizer::expandLoad(const SIMemOpInfo &MOI,
bool Changed = false;
if (MOI.isAtomic()) {
- if (MOI.getOrdering() == AtomicOrdering::Monotonic ||
- MOI.getOrdering() == AtomicOrdering::Acquire ||
- MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent) {
+ const AtomicOrdering Order = MOI.getOrdering();
+ if (Order == AtomicOrdering::Monotonic ||
+ Order == AtomicOrdering::Acquire ||
+ Order == AtomicOrdering::SequentiallyConsistent) {
Changed |= CC->enableLoadCacheBypass(MI, MOI.getScope(),
MOI.getOrderingAddrSpace());
}
- if (MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent)
- Changed |= CC->insertWait(MI, MOI.getScope(),
- MOI.getOrderingAddrSpace(),
+ if (Order == AtomicOrdering::SequentiallyConsistent)
+ Changed |= CC->insertWait(MI, MOI.getScope(), MOI.getOrderingAddrSpace(),
SIMemOp::LOAD | SIMemOp::STORE,
MOI.getIsCrossAddressSpaceOrdering(),
- Position::BEFORE);
+ Position::BEFORE, Order);
- if (MOI.getOrdering() == AtomicOrdering::Acquire ||
- MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent) {
- Changed |= CC->insertWait(MI, MOI.getScope(),
- MOI.getInstrAddrSpace(),
- SIMemOp::LOAD,
- MOI.getIsCrossAddressSpaceOrdering(),
- Position::AFTER);
+ if (Order == AtomicOrdering::Acquire ||
+ Order == AtomicOrdering::SequentiallyConsistent) {
+ Changed |= CC->insertWait(
+ MI, MOI.getScope(), MOI.getInstrAddrSpace(), SIMemOp::LOAD,
+ MOI.getIsCrossAddressSpaceOrdering(), Position::AFTER, Order);
Changed |= CC->insertAcquire(MI, MOI.getScope(),
MOI.getOrderingAddrSpace(),
Position::AFTER);
@@ -2715,14 +2717,16 @@ bool SIMemoryLegalizer::expandAtomicFence(const SIMemOpInfo &MOI,
getFenceAddrSpaceMMRA(*MI, MOI.getOrderingAddrSpace());
if (MOI.isAtomic()) {
- if (MOI.getOrdering() == AtomicOrdering::Acquire)
+ const AtomicOrdering Order = MOI.getOrdering();
+ if (Order == AtomicOrdering::Acquire) {
Changed |= CC->insertWait(
MI, MOI.getScope(), OrderingAddrSpace, SIMemOp::LOAD | SIMemOp::STORE,
- MOI.getIsCrossAddressSpaceOrdering(), Position::BEFORE);
+ MOI.getIsCrossAddressSpaceOrdering(), Position::BEFORE, Order);
+ }
- if (MOI.getOrdering() == AtomicOrdering::Release ||
- MOI.getOrdering() == AtomicOrdering::AcquireRelease ||
- MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent)
+ if (Order == AtomicOrdering::Release ||
+ Order == AtomicOrdering::AcquireRelease ||
+ Order == AtomicOrdering::SequentiallyConsistent)
/// TODO: This relies on a barrier always generating a waitcnt
/// for LDS to ensure it is not reordered with the completion of
/// the proceeding LDS operations. If barrier had a memory
@@ -2739,9 +2743,9 @@ bool SIMemoryLegalizer::expandAtomicFence(const SIMemOpInfo &MOI,
// reorganizing this code or as part of optimizing SIInsertWaitcnt pass to
// track cache invalidate and write back instructions.
- if (MOI.getOrdering() == AtomicOrdering::Acquire ||
- MOI.getOrdering() == AtomicOrdering::AcquireRelease ||
- MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent)
+ if (Order == AtomicOrdering::Acquire ||
+ Order == AtomicOrdering::AcquireRelease ||
+ Order == AtomicOrdering::SequentiallyConsistent)
Changed |= CC->insertAcquire(MI, MOI.getScope(), OrderingAddrSpace,
Position::BEFORE);
@@ -2758,35 +2762,33 @@ bool SIMemoryLegalizer::expandAtomicCmpxchgOrRmw(const SIMemOpInfo &MOI,
bool Changed = false;
if (MOI.isAtomic()) {
- if (MOI.getOrdering() == AtomicOrdering::Monotonic ||
- MOI.getOrdering() == AtomicOrdering::Acquire ||
- MOI.getOrdering() == AtomicOrdering::Release ||
- MOI.getOrdering() == AtomicOrdering::AcquireRelease ||
- MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent) {
+ const AtomicOrdering Order = MOI.getOrdering();
+ if (Order == AtomicOrdering::Monotonic ||
+ Order == AtomicOrdering::Acquire || Order == AtomicOrdering::Release ||
+ Order == AtomicOrdering::AcquireRelease ||
+ Order == AtomicOrdering::SequentiallyConsistent) {
Changed |= CC->enableRMWCacheBypass(MI, MOI.getScope(),
MOI.getInstrAddrSpace());
}
- if (MOI.getOrdering() == AtomicOrdering::Release ||
- MOI.getOrdering() == AtomicOrdering::AcquireRelease ||
- MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent ||
+ if (Order == AtomicOrdering::Release ||
+ Order == AtomicOrdering::AcquireRelease ||
+ Order == AtomicOrdering::SequentiallyConsistent ||
MOI.getFailureOrdering() == AtomicOrdering::SequentiallyConsistent)
Changed |= CC->insertRelease(MI, MOI.getScope(),
MOI.getOrderingAddrSpace(),
MOI.getIsCrossAddressSpaceOrdering(),
Position::BEFORE);
- if (MOI.getOrdering() == AtomicOrdering::Acquire ||
- MOI.getOrdering() == AtomicOrdering::AcquireRelease ||
- MOI.getOrdering() == AtomicOrdering::SequentiallyConsistent ||
+ if (Order == AtomicOrdering::Acquire ||
+ Order == AtomicOrdering::AcquireRelease ||
+ Order == AtomicOrdering::SequentiallyConsistent ||
MOI.getFailureOrdering() == AtomicOrdering::Acquire ||
MOI.getFailureOrdering() == AtomicOrdering::SequentiallyConsistent) {
- Changed |= CC->insertWait(MI, MOI.getScope(),
- MOI.getInstrAddrSpace(),
- isAtomicRet(*MI) ? SIMemOp::LOAD :
- SIMemOp::STORE,
- MOI.getIsCrossAddressSpaceOrdering(),
- Position::AFTER);
+ Changed |= CC->insertWait(
+ MI, MOI.getScope(), MOI.getInstrAddrSpace(),
+ isAtomicRet(*MI) ? SIMemOp::LOAD : SIMemOp::STORE,
+ MOI.getIsCrossAddressSpaceOrdering(), Position::AFTER, Order);
Changed |= CC->insertAcquire(MI, MOI.getScope(),
MOI.getOrderingAddrSpace(),
Position::AFTER);
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll
index 6b7a6fb27fadfa..b8fa35092baf8a 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence-mmra-global.ll
@@ -70,8 +70,6 @@ define amdgpu_kernel void @workgroup_acquire_fence() {
;
; GFX12-WGP-LABEL: workgroup_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
@@ -356,8 +354,6 @@ define amdgpu_kernel void @workgroup_one_as_acquire_fence() {
;
; GFX12-WGP-LABEL: workgroup_one_as_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
@@ -661,8 +657,6 @@ define amdgpu_kernel void @agent_acquire_fence() {
;
; GFX12-WGP-LABEL: agent_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
@@ -670,8 +664,6 @@ define amdgpu_kernel void @agent_acquire_fence() {
;
; GFX12-CU-LABEL: agent_acquire_fence:
; GFX12-CU: ; %bb.0: ; %entry
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
@@ -1041,8 +1033,6 @@ define amdgpu_kernel void @agent_one_as_acquire_fence() {
;
; GFX12-WGP-LABEL: agent_one_as_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
@@ -1050,8 +1040,6 @@ define amdgpu_kernel void @agent_one_as_acquire_fence() {
;
; GFX12-CU-LABEL: agent_one_as_acquire_fence:
; GFX12-CU: ; %bb.0: ; %entry
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
@@ -1423,8 +1411,6 @@ define amdgpu_kernel void @system_acquire_fence() {
;
; GFX12-WGP-LABEL: system_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
@@ -1432,8 +1418,6 @@ define amdgpu_kernel void @system_acquire_fence() {
;
; GFX12-CU-LABEL: system_acquire_fence:
; GFX12-CU: ; %bb.0: ; %entry
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
@@ -1815,8 +1799,6 @@ define amdgpu_kernel void @system_one_as_acquire_fence() {
;
; GFX12-WGP-LABEL: system_one_as_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
@@ -1824,8 +1806,6 @@ define amdgpu_kernel void @system_one_as_acquire_fence() {
;
; GFX12-CU-LABEL: system_one_as_acquire_fence:
; GFX12-CU: ; %bb.0: ; %entry
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll
index af7c66a2bd2cd9..ea1b8ceb94f11a 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-fence.ll
@@ -989,8 +989,6 @@ define amdgpu_kernel void @workgroup_acquire_fence() {
;
; GFX12-WGP-LABEL: workgroup_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
@@ -1300,8 +1298,6 @@ define amdgpu_kernel void @workgroup_one_as_acquire_fence() {
;
; GFX12-WGP-LABEL: workgroup_one_as_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
@@ -1605,8 +1601,6 @@ define amdgpu_kernel void @agent_acquire_fence() {
;
; GFX12-WGP-LABEL: agent_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
@@ -1614,8 +1608,6 @@ define amdgpu_kernel void @agent_acquire_fence() {
;
; GFX12-CU-LABEL: agent_acquire_fence:
; GFX12-CU: ; %bb.0: ; %entry
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
@@ -1985,8 +1977,6 @@ define amdgpu_kernel void @agent_one_as_acquire_fence() {
;
; GFX12-WGP-LABEL: agent_one_as_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
@@ -1994,8 +1984,6 @@ define amdgpu_kernel void @agent_one_as_acquire_fence() {
;
; GFX12-CU-LABEL: agent_one_as_acquire_fence:
; GFX12-CU: ; %bb.0: ; %entry
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
@@ -2367,8 +2355,6 @@ define amdgpu_kernel void @system_acquire_fence() {
;
; GFX12-WGP-LABEL: system_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
@@ -2376,8 +2362,6 @@ define amdgpu_kernel void @system_acquire_fence() {
;
; GFX12-CU-LABEL: system_acquire_fence:
; GFX12-CU: ; %bb.0: ; %entry
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
@@ -2759,8 +2743,6 @@ define amdgpu_kernel void @system_one_as_acquire_fence() {
;
; GFX12-WGP-LABEL: system_one_as_acquire_fence:
; GFX12-WGP: ; %bb.0: ; %entry
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
@@ -2768,8 +2750,6 @@ define amdgpu_kernel void @system_one_as_acquire_fence() {
;
; GFX12-CU-LABEL: system_one_as_acquire_fence:
; GFX12-CU: ; %bb.0: ; %entry
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll
index 45e8b3bcff13c5..fe214703cb0f10 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-agent.ll
@@ -553,8 +553,6 @@ define amdgpu_kernel void @flat_agent_acquire_load(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s2
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s3
; GFX12-WGP-NEXT: flat_load_b32 v2, v[0:1] scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -571,8 +569,6 @@ define amdgpu_kernel void @flat_agent_acquire_load(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s2
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s3
; GFX12-CU-NEXT: flat_load_b32 v2, v[0:1] scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -2526,8 +2522,6 @@ define amdgpu_kernel void @flat_agent_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, s2
; GFX12-WGP-NEXT: flat_atomic_swap_b32 v2, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -2544,8 +2538,6 @@ define amdgpu_kernel void @flat_agent_acquire_ret_atomicrmw(
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: v_mov_b32_e32 v2, s2
; GFX12-CU-NEXT: flat_atomic_swap_b32 v2, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -7698,8 +7690,6 @@ define amdgpu_kernel void @flat_agent_acquire_monotonic_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -7721,8 +7711,6 @@ define amdgpu_kernel void @flat_agent_acquire_monotonic_ret_cmpxchg(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -9217,8 +9205,6 @@ define amdgpu_kernel void @flat_agent_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -9240,8 +9226,6 @@ define amdgpu_kernel void @flat_agent_acquire_acquire_ret_cmpxchg(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -10802,8 +10786,6 @@ define amdgpu_kernel void @flat_agent_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -10830,8 +10812,6 @@ define amdgpu_kernel void @flat_agent_acquire_seq_cst_ret_cmpxchg(
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -12350,8 +12330,6 @@ define amdgpu_kernel void @flat_agent_one_as_acquire_load(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s2
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s3
; GFX12-WGP-NEXT: flat_load_b32 v2, v[0:1] scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -12369,8 +12347,6 @@ define amdgpu_kernel void @flat_agent_one_as_acquire_load(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s2
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s3
; GFX12-CU-NEXT: flat_load_b32 v2, v[0:1] scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -14331,8 +14307,6 @@ define amdgpu_kernel void @flat_agent_one_as_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, s2
; GFX12-WGP-NEXT: flat_atomic_swap_b32 v2, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -14350,8 +14324,6 @@ define amdgpu_kernel void @flat_agent_one_as_acquire_ret_atomicrmw(
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: v_mov_b32_e32 v2, s2
; GFX12-CU-NEXT: flat_atomic_swap_b32 v2, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -19481,8 +19453,6 @@ define amdgpu_kernel void @flat_agent_one_as_acquire_monotonic_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -19505,8 +19475,6 @@ define amdgpu_kernel void @flat_agent_one_as_acquire_monotonic_ret_cmpxchg(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -21040,8 +21008,6 @@ define amdgpu_kernel void @flat_agent_one_as_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -21064,8 +21030,6 @@ define amdgpu_kernel void @flat_agent_one_as_acquire_acquire_ret_cmpxchg(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -22675,8 +22639,6 @@ define amdgpu_kernel void @flat_agent_one_as_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -22704,8 +22666,6 @@ define amdgpu_kernel void @flat_agent_one_as_acquire_seq_cst_ret_cmpxchg(
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll
index e77f1432c1c9d0..22a1bb4e694724 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-system.ll
@@ -555,8 +555,6 @@ define amdgpu_kernel void @flat_system_acquire_load(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s2
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s3
; GFX12-WGP-NEXT: flat_load_b32 v2, v[0:1] scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -573,8 +571,6 @@ define amdgpu_kernel void @flat_system_acquire_load(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s2
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s3
; GFX12-CU-NEXT: flat_load_b32 v2, v[0:1] scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -2548,8 +2544,6 @@ define amdgpu_kernel void @flat_system_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, s2
; GFX12-WGP-NEXT: flat_atomic_swap_b32 v2, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -2566,8 +2560,6 @@ define amdgpu_kernel void @flat_system_acquire_ret_atomicrmw(
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: v_mov_b32_e32 v2, s2
; GFX12-CU-NEXT: flat_atomic_swap_b32 v2, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -7778,8 +7770,6 @@ define amdgpu_kernel void @flat_system_acquire_monotonic_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -7801,8 +7791,6 @@ define amdgpu_kernel void @flat_system_acquire_monotonic_ret_cmpxchg(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -9311,8 +9299,6 @@ define amdgpu_kernel void @flat_system_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -9334,8 +9320,6 @@ define amdgpu_kernel void @flat_system_acquire_acquire_ret_cmpxchg(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -10916,8 +10900,6 @@ define amdgpu_kernel void @flat_system_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -10944,8 +10926,6 @@ define amdgpu_kernel void @flat_system_acquire_seq_cst_ret_cmpxchg(
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -12478,8 +12458,6 @@ define amdgpu_kernel void @flat_system_one_as_acquire_load(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s2
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s3
; GFX12-WGP-NEXT: flat_load_b32 v2, v[0:1] scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -12497,8 +12475,6 @@ define amdgpu_kernel void @flat_system_one_as_acquire_load(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s2
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s3
; GFX12-CU-NEXT: flat_load_b32 v2, v[0:1] scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -14479,8 +14455,6 @@ define amdgpu_kernel void @flat_system_one_as_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, s2
; GFX12-WGP-NEXT: flat_atomic_swap_b32 v2, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -14498,8 +14472,6 @@ define amdgpu_kernel void @flat_system_one_as_acquire_ret_atomicrmw(
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: v_mov_b32_e32 v2, s2
; GFX12-CU-NEXT: flat_atomic_swap_b32 v2, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -19687,8 +19659,6 @@ define amdgpu_kernel void @flat_system_one_as_acquire_monotonic_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -19711,8 +19681,6 @@ define amdgpu_kernel void @flat_system_one_as_acquire_monotonic_ret_cmpxchg(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -21260,8 +21228,6 @@ define amdgpu_kernel void @flat_system_one_as_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -21284,8 +21250,6 @@ define amdgpu_kernel void @flat_system_one_as_acquire_acquire_ret_cmpxchg(
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s1
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
@@ -22915,8 +22879,6 @@ define amdgpu_kernel void @flat_system_one_as_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -22944,8 +22906,6 @@ define amdgpu_kernel void @flat_system_one_as_acquire_seq_cst_ret_cmpxchg(
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: v_mov_b32_e32 v0, s0
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll
index 6bf54ccabc9dad..ab4eb0f3dce1fc 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-volatile.ll
@@ -902,8 +902,6 @@ define amdgpu_kernel void @flat_volatile_workgroup_acquire_load(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s2
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s3
; GFX12-WGP-NEXT: flat_load_b32 v2, v[0:1] scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll
index 8949e4b782f630..953b35cb5002aa 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-flat-workgroup.ll
@@ -550,8 +550,6 @@ define amdgpu_kernel void @flat_workgroup_acquire_load(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s2
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s3
; GFX12-WGP-NEXT: flat_load_b32 v2, v[0:1] scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -2432,8 +2430,6 @@ define amdgpu_kernel void @flat_workgroup_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, s2
; GFX12-WGP-NEXT: flat_atomic_swap_b32 v2, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -6257,8 +6253,6 @@ define amdgpu_kernel void @flat_workgroup_acquire_monotonic_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -7732,8 +7726,6 @@ define amdgpu_kernel void @flat_workgroup_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -9256,8 +9248,6 @@ define amdgpu_kernel void @flat_workgroup_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -10744,8 +10734,6 @@ define amdgpu_kernel void @flat_workgroup_one_as_acquire_load(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s2
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s3
; GFX12-WGP-NEXT: flat_load_b32 v2, v[0:1] scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -12549,8 +12537,6 @@ define amdgpu_kernel void @flat_workgroup_one_as_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, s2
; GFX12-WGP-NEXT: flat_atomic_swap_b32 v2, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -17226,8 +17212,6 @@ define amdgpu_kernel void @flat_workgroup_one_as_acquire_monotonic_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -18664,8 +18648,6 @@ define amdgpu_kernel void @flat_workgroup_one_as_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s1
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
@@ -20134,8 +20116,6 @@ define amdgpu_kernel void @flat_workgroup_one_as_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: flat_atomic_cmpswap_b32 v2, v[0:1], v[2:3] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: v_mov_b32_e32 v0, s0
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll
index b56860991b1948..6b36d130ba44b8 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-agent.ll
@@ -593,8 +593,6 @@ define amdgpu_kernel void @global_agent_acquire_load(
; GFX12-WGP-NEXT: s_load_b64 s[0:1], s[0:1], 0x8
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: global_load_b32 v1, v0, s[2:3] scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -608,8 +606,6 @@ define amdgpu_kernel void @global_agent_acquire_load(
; GFX12-CU-NEXT: s_load_b64 s[0:1], s[0:1], 0x8
; GFX12-CU-NEXT: s_wait_kmcnt 0x0
; GFX12-CU-NEXT: global_load_b32 v1, v0, s[2:3] scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -2710,8 +2706,6 @@ define amdgpu_kernel void @global_agent_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s2
; GFX12-WGP-NEXT: global_atomic_swap_b32 v1, v0, v1, s[0:1] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -2725,8 +2719,6 @@ define amdgpu_kernel void @global_agent_acquire_ret_atomicrmw(
; GFX12-CU-NEXT: s_wait_kmcnt 0x0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s2
; GFX12-CU-NEXT: global_atomic_swap_b32 v1, v0, v1, s[0:1] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -7734,8 +7726,6 @@ define amdgpu_kernel void @global_agent_acquire_monotonic_ret_cmpxchg(
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -7754,8 +7744,6 @@ define amdgpu_kernel void @global_agent_acquire_monotonic_ret_cmpxchg(
; GFX12-CU-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-CU-NEXT: v_mov_b32_e32 v2, v3
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -9174,8 +9162,6 @@ define amdgpu_kernel void @global_agent_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -9194,8 +9180,6 @@ define amdgpu_kernel void @global_agent_acquire_acquire_ret_cmpxchg(
; GFX12-CU-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-CU-NEXT: v_mov_b32_e32 v2, v3
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -10684,8 +10668,6 @@ define amdgpu_kernel void @global_agent_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -10709,8 +10691,6 @@ define amdgpu_kernel void @global_agent_acquire_seq_cst_ret_cmpxchg(
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -12213,8 +12193,6 @@ define amdgpu_kernel void @global_agent_one_as_acquire_load(
; GFX12-WGP-NEXT: s_load_b64 s[0:1], s[0:1], 0x8
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: global_load_b32 v1, v0, s[2:3] scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -12228,8 +12206,6 @@ define amdgpu_kernel void @global_agent_one_as_acquire_load(
; GFX12-CU-NEXT: s_load_b64 s[0:1], s[0:1], 0x8
; GFX12-CU-NEXT: s_wait_kmcnt 0x0
; GFX12-CU-NEXT: global_load_b32 v1, v0, s[2:3] scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -14330,8 +14306,6 @@ define amdgpu_kernel void @global_agent_one_as_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s2
; GFX12-WGP-NEXT: global_atomic_swap_b32 v1, v0, v1, s[0:1] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -14345,8 +14319,6 @@ define amdgpu_kernel void @global_agent_one_as_acquire_ret_atomicrmw(
; GFX12-CU-NEXT: s_wait_kmcnt 0x0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s2
; GFX12-CU-NEXT: global_atomic_swap_b32 v1, v0, v1, s[0:1] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -19354,8 +19326,6 @@ define amdgpu_kernel void @global_agent_one_as_acquire_monotonic_ret_cmpxchg(
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -19374,8 +19344,6 @@ define amdgpu_kernel void @global_agent_one_as_acquire_monotonic_ret_cmpxchg(
; GFX12-CU-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-CU-NEXT: v_mov_b32_e32 v2, v3
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -20512,8 +20480,6 @@ define amdgpu_kernel void @global_agent_one_as_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -20532,8 +20498,6 @@ define amdgpu_kernel void @global_agent_one_as_acquire_acquire_ret_cmpxchg(
; GFX12-CU-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-CU-NEXT: v_mov_b32_e32 v2, v3
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -22022,8 +21986,6 @@ define amdgpu_kernel void @global_agent_one_as_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_DEV
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -22047,8 +22009,6 @@ define amdgpu_kernel void @global_agent_one_as_acquire_seq_cst_ret_cmpxchg(
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_DEV
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_DEV
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll
index 62a4f3b43b2dcd..a8f7d8468449e5 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-system.ll
@@ -595,8 +595,6 @@ define amdgpu_kernel void @global_system_acquire_load(
; GFX12-WGP-NEXT: s_load_b64 s[0:1], s[0:1], 0x8
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: global_load_b32 v1, v0, s[2:3] scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -610,8 +608,6 @@ define amdgpu_kernel void @global_system_acquire_load(
; GFX12-CU-NEXT: s_load_b64 s[0:1], s[0:1], 0x8
; GFX12-CU-NEXT: s_wait_kmcnt 0x0
; GFX12-CU-NEXT: global_load_b32 v1, v0, s[2:3] scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -2732,8 +2728,6 @@ define amdgpu_kernel void @global_system_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s2
; GFX12-WGP-NEXT: global_atomic_swap_b32 v1, v0, v1, s[0:1] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -2747,8 +2741,6 @@ define amdgpu_kernel void @global_system_acquire_ret_atomicrmw(
; GFX12-CU-NEXT: s_wait_kmcnt 0x0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s2
; GFX12-CU-NEXT: global_atomic_swap_b32 v1, v0, v1, s[0:1] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -6678,8 +6670,6 @@ define amdgpu_kernel void @global_system_acquire_monotonic_ret_cmpxchg(
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -6698,8 +6688,6 @@ define amdgpu_kernel void @global_system_acquire_monotonic_ret_cmpxchg(
; GFX12-CU-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-CU-NEXT: v_mov_b32_e32 v2, v3
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -7848,8 +7836,6 @@ define amdgpu_kernel void @global_system_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -7868,8 +7854,6 @@ define amdgpu_kernel void @global_system_acquire_acquire_ret_cmpxchg(
; GFX12-CU-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-CU-NEXT: v_mov_b32_e32 v2, v3
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -9378,8 +9362,6 @@ define amdgpu_kernel void @global_system_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -9403,8 +9385,6 @@ define amdgpu_kernel void @global_system_acquire_seq_cst_ret_cmpxchg(
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -10921,8 +10901,6 @@ define amdgpu_kernel void @global_system_one_as_acquire_load(
; GFX12-WGP-NEXT: s_load_b64 s[0:1], s[0:1], 0x8
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: global_load_b32 v1, v0, s[2:3] scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -10936,8 +10914,6 @@ define amdgpu_kernel void @global_system_one_as_acquire_load(
; GFX12-CU-NEXT: s_load_b64 s[0:1], s[0:1], 0x8
; GFX12-CU-NEXT: s_wait_kmcnt 0x0
; GFX12-CU-NEXT: global_load_b32 v1, v0, s[2:3] scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -13058,8 +13034,6 @@ define amdgpu_kernel void @global_system_one_as_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s2
; GFX12-WGP-NEXT: global_atomic_swap_b32 v1, v0, v1, s[0:1] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -13073,8 +13047,6 @@ define amdgpu_kernel void @global_system_one_as_acquire_ret_atomicrmw(
; GFX12-CU-NEXT: s_wait_kmcnt 0x0
; GFX12-CU-NEXT: v_mov_b32_e32 v1, s2
; GFX12-CU-NEXT: global_atomic_swap_b32 v1, v0, v1, s[0:1] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -18140,8 +18112,6 @@ define amdgpu_kernel void @global_system_one_as_acquire_monotonic_ret_cmpxchg(
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -18160,8 +18130,6 @@ define amdgpu_kernel void @global_system_one_as_acquire_monotonic_ret_cmpxchg(
; GFX12-CU-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-CU-NEXT: v_mov_b32_e32 v2, v3
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -19594,8 +19562,6 @@ define amdgpu_kernel void @global_system_one_as_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -19614,8 +19580,6 @@ define amdgpu_kernel void @global_system_one_as_acquire_acquire_ret_cmpxchg(
; GFX12-CU-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-CU-NEXT: v_mov_b32_e32 v2, v3
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -21124,8 +21088,6 @@ define amdgpu_kernel void @global_system_one_as_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SYS
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -21149,8 +21111,6 @@ define amdgpu_kernel void @global_system_one_as_acquire_seq_cst_ret_cmpxchg(
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: s_wait_storecnt 0x0
; GFX12-CU-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
-; GFX12-CU-NEXT: s_wait_bvhcnt 0x0
-; GFX12-CU-NEXT: s_wait_samplecnt 0x0
; GFX12-CU-NEXT: s_wait_loadcnt 0x0
; GFX12-CU-NEXT: global_inv scope:SCOPE_SYS
; GFX12-CU-NEXT: global_store_b32 v0, v1, s[0:1]
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll
index a98efb49b4b72b..84ec44f7d00fa4 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-volatile.ll
@@ -838,8 +838,6 @@ define amdgpu_kernel void @global_volatile_workgroup_acquire_load(
; GFX12-WGP-NEXT: s_load_b64 s[0:1], s[0:1], 0x8
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: global_load_b32 v1, v0, s[2:3] scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
diff --git a/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll b/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll
index 30bf4920715352..7ed5582db46c50 100644
--- a/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll
+++ b/llvm/test/CodeGen/AMDGPU/memory-legalizer-global-workgroup.ll
@@ -583,8 +583,6 @@ define amdgpu_kernel void @global_workgroup_acquire_load(
; GFX12-WGP-NEXT: s_load_b64 s[0:1], s[0:1], 0x8
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: global_load_b32 v1, v0, s[2:3] scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -2572,8 +2570,6 @@ define amdgpu_kernel void @global_workgroup_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s2
; GFX12-WGP-NEXT: global_atomic_swap_b32 v1, v0, v1, s[0:1] th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -7205,8 +7201,6 @@ define amdgpu_kernel void @global_workgroup_acquire_monotonic_ret_cmpxchg(
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -8565,8 +8559,6 @@ define amdgpu_kernel void @global_workgroup_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -9969,8 +9961,6 @@ define amdgpu_kernel void @global_workgroup_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt_dscnt 0x0
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -11415,8 +11405,6 @@ define amdgpu_kernel void @global_workgroup_one_as_acquire_load(
; GFX12-WGP-NEXT: s_load_b64 s[0:1], s[0:1], 0x8
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: global_load_b32 v1, v0, s[2:3] scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -13360,8 +13348,6 @@ define amdgpu_kernel void @global_workgroup_one_as_acquire_ret_atomicrmw(
; GFX12-WGP-NEXT: s_wait_kmcnt 0x0
; GFX12-WGP-NEXT: v_mov_b32_e32 v1, s2
; GFX12-WGP-NEXT: global_atomic_swap_b32 v1, v0, v1, s[0:1] th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -17889,8 +17875,6 @@ define amdgpu_kernel void @global_workgroup_one_as_acquire_monotonic_ret_cmpxchg
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -19225,8 +19209,6 @@ define amdgpu_kernel void @global_workgroup_one_as_acquire_acquire_ret_cmpxchg(
; GFX12-WGP-NEXT: ; kill: def $vgpr1 killed $vgpr1 def $vgpr1_vgpr2 killed $exec
; GFX12-WGP-NEXT: v_mov_b32_e32 v2, v3
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
@@ -20590,8 +20572,6 @@ define amdgpu_kernel void @global_workgroup_one_as_acquire_seq_cst_ret_cmpxchg(
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: s_wait_storecnt 0x0
; GFX12-WGP-NEXT: global_atomic_cmpswap_b32 v1, v0, v[1:2], s[0:1] offset:16 th:TH_ATOMIC_RETURN scope:SCOPE_SE
-; GFX12-WGP-NEXT: s_wait_bvhcnt 0x0
-; GFX12-WGP-NEXT: s_wait_samplecnt 0x0
; GFX12-WGP-NEXT: s_wait_loadcnt 0x0
; GFX12-WGP-NEXT: global_inv scope:SCOPE_SE
; GFX12-WGP-NEXT: global_store_b32 v0, v1, s[0:1]
More information about the llvm-commits
mailing list