[llvm] [AMDGPU] Make globally-addressable-scratch opt-in (PR #189555)

Tue Mar 31 01:05:15 PDT 2026

llvmbot wrote:




@llvm/pr-subscribers-llvm-transforms

Author: Pierre van Houtryve (Pierre-vh)

<details>
<summary>Changes</summary>

This feature is meant to be opt-in for more advanced users, not default-enabled.
It may reduce performance otherwise as we can't assume private AS is thread-local
when it is enabled.

- Add `HasGloballyAddressableScratchSupport` feature to check if a target's scratch
  addressing is changed due to support for globally addressable scratch.
- Use `EnableGloballyAddressableScratch` to check whether the user opted into
  globally addressable scratch. This affects whether to lower scratch atomics as flat,
  and in the future will affect whether NV=1 can be set on scratch accesses.

---

Patch is 2.56 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/189555.diff


19 Files Affected:

- (modified) llvm/docs/AMDGPUUsage.rst (+72-72) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+12-5) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+4-4) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp (+2-2) 
- (modified) llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp (+3-3) 
- (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+5) 
- (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+7-7) 
- (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+1-1) 
- (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+1-1) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-agent.ll (+4816-3612) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-cluster.ll (+4595-3391) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-lastuse.ll (+5-1) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-nontemporal.ll (+5-1) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-singlethread.ll (+4412-3193) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-system.ll (+4584-3436) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-volatile.ll (+5-1) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-wavefront.ll (+4412-3193) 
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-workgroup.ll (+4564-3345) 
- (modified) llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-private-gas.ll (+152-69) 


``````````diff

diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 1ede5ca2d4cf6..388e1b33fe12d 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -541,27 +541,21 @@ Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following
                                                                         work-item
                                                                         IDs
 
-     ``gfx1250``                 ``amdgcn``   APU                     - Architected                   *TBA*
-                                                                        flat
-                                                                        scratch                       .. TODO::
+     ``gfx1250``                 ``amdgcn``   APU   - globally-       - Architected                   *TBA*
+                                                      addressable-      flat
+                                                      scratch           scratch                       .. TODO::
                                                                       - Packed
                                                                         work-item                       Add product
                                                                         IDs                             names.
-                                                                      - Globally
-                                                                        Accessible
-                                                                        Scratch
                                                                       - Workgroup
                                                                         Clusters
 
-     ``gfx1251``                 ``amdgcn``   APU                     - Architected                   *TBA*
-                                                                        flat
-                                                                        scratch                       .. TODO::
+     ``gfx1251``                 ``amdgcn``   APU   - globally-       - Architected                   *TBA*
+                                                      addressable-      flat
+                                                      scratch           scratch                       .. TODO::
                                                                       - Packed
                                                                         work-item                       Add product
                                                                         IDs                             names.
-                                                                      - Globally
-                                                                        Accessible
-                                                                        Scratch
                                                                       - Workgroup
                                                                         Clusters
 
@@ -753,64 +747,70 @@ For example:
   .. table:: AMDGPU Target Features
      :name: amdgpu-target-features-table
 
-     =============== ============================ ==================================================
-     Target Feature  Clang Option to Control      Description
+     ============================= ============================ ==================================================
+     Target Feature                Clang Option to Control      Description
      Name
-     =============== ============================ ==================================================
-     cumode          - ``-m[no-]cumode``          Control the wavefront execution mode used
-                                                  when generating code for kernels. When disabled
-                                                  native WGP wavefront execution mode is used,
-                                                  when enabled CU wavefront execution mode is used
-                                                  (see :ref:`amdgpu-amdhsa-memory-model`).
-
-     sramecc         - ``-mcpu``                  If specified, generate code that can only be
-                     - ``--offload-arch``         loaded and executed in a process that has a
-                                                  matching setting for SRAMECC.
-
-                                                  If not specified for code object V2 to V3, generate
-                                                  code that can be loaded and executed in a process
-                                                  with SRAMECC enabled.
-
-                                                  If not specified for code object V4 or above, generate
-                                                  code that can be loaded and executed in a process
-                                                  with either setting of SRAMECC.
-
-     tgsplit           ``-m[no-]tgsplit``         Enable/disable generating code that assumes
-                                                  work-groups are launched in threadgroup split mode.
-                                                  When enabled the waves of a work-group may be
-                                                  launched in different CUs.
-
-     wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
-                                                  generating code for kernels. When disabled
-                                                  native wavefront size 32 is used, when enabled
-                                                  wavefront size 64 is used.
-
-     xnack           - ``-mcpu``                  If specified, generate code that can only be
-                     - ``--offload-arch``         loaded and executed in a process that has a
-                                                  matching setting for XNACK replay.
-
-                                                  If not specified for code object V2 to V3, generate
-                                                  code that can be loaded and executed in a process
-                                                  with XNACK replay enabled.
-
-                                                  If not specified for code object V4 or above, generate
-                                                  code that can be loaded and executed in a process
-                                                  with either setting of XNACK replay.
-
-                                                  XNACK replay can be used for demand paging and
-                                                  page migration. If enabled in the device, then if
-                                                  a page fault occurs the code may execute
-                                                  incorrectly unless generated with XNACK replay
-                                                  enabled, or generated for code object V4 or above without
-                                                  specifying XNACK replay. Executing code that was
-                                                  generated with XNACK replay enabled, or generated
-                                                  for code object V4 or above without specifying XNACK replay,
-                                                  on a device that does not have XNACK replay
-                                                  enabled will execute correctly but may be less
-                                                  performant than code generated for XNACK replay
-                                                  disabled.
-
-     =============== ============================ ==================================================
+     ============================= ============================ ==================================================
+     cumode                        - ``-m[no-]cumode``          Control the wavefront execution mode used
+                                                                when generating code for kernels. When disabled
+                                                                native WGP wavefront execution mode is used,
+                                                                when enabled CU wavefront execution mode is used
+                                                                (see :ref:`amdgpu-amdhsa-memory-model`).
+
+     sramecc                       - ``-mcpu``                  If specified, generate code that can only be
+                                   - ``--offload-arch``         loaded and executed in a process that has a
+                                                                matching setting for SRAMECC.
+
+                                                                If not specified for code object V2 to V3, generate
+                                                                code that can be loaded and executed in a process
+                                                                with SRAMECC enabled.
+
+                                                                If not specified for code object V4 or above, generate
+                                                                code that can be loaded and executed in a process
+                                                                with either setting of SRAMECC.
+
+     tgsplit                         ``-m[no-]tgsplit``         Enable/disable generating code that assumes
+                                                                work-groups are launched in threadgroup split mode.
+                                                                When enabled the waves of a work-group may be
+                                                                launched in different CUs.
+
+     wavefrontsize64               - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
+                                                                generating code for kernels. When disabled
+                                                                native wavefront size 32 is used, when enabled
+                                                                wavefront size 64 is used.
+
+     xnack                         - ``-mcpu``                  If specified, generate code that can only be
+                                   - ``--offload-arch``         loaded and executed in a process that has a
+                                                                matching setting for XNACK replay.
+
+                                                                If not specified for code object V2 to V3, generate
+                                                                code that can be loaded and executed in a process
+                                                                with XNACK replay enabled.
+
+                                                                If not specified for code object V4 or above, generate
+                                                                code that can be loaded and executed in a process
+                                                                with either setting of XNACK replay.
+
+                                                                XNACK replay can be used for demand paging and
+                                                                page migration. If enabled in the device, then if
+                                                                a page fault occurs the code may execute
+                                                                incorrectly unless generated with XNACK replay
+                                                                enabled, or generated for code object V4 or above without
+                                                                specifying XNACK replay. Executing code that was
+                                                                generated with XNACK replay enabled, or generated
+                                                                for code object V4 or above without specifying XNACK replay,
+                                                                on a device that does not have XNACK replay
+                                                                enabled will execute correctly but may be less
+                                                                performant than code generated for XNACK replay
+                                                                disabled.
+
+     globally-addressable-scratch  - ``--offload-arch``         When enabled, scratch (private) memory can be shared
+                                                                between threads without triggering undefined behavior.
+                                                                Disabled by default as this may incur a performance penalty
+                                                                because the compiler can no longer assume private memory is
+                                                                thread-local when this is enabled.
+
+     ============================= ============================ ==================================================
 
 .. _amdgpu-target-id:
 
@@ -1009,7 +1009,7 @@ supported for the ``amdgcn`` target.
   access is not supported except by flat and scratch instructions in
   GFX9-GFX11.
 
-  On targets without "Globally Accessible Scratch" (introduced in GFX125x), code that
+  On targets without ``globally-addressable-scratch``, or if the feature is disabled, code that
   manipulates the stack values in other lanes of a wavefront, such as by
   ``addrspacecast``-ing stack pointers to generic ones and taking offsets that reach other
   lanes or by explicitly constructing the scratch buffer descriptor, triggers undefined
@@ -17350,8 +17350,8 @@ For GFX125x:
 
   This section is currently incomplete as work on the compiler is still ongoing.
   The following is a non-exhaustive list of unimplemented/undocumented features:
-  non-volatile bit code sequences, globally accessing scratch atomics,
-  multicast loads, barriers (including split barriers) and cooperative atomics.
+  non-volatile bit code sequences, multicast loads, barriers (including split barriers)
+  and cooperative atomics.
   Scalar operations memory model needs more elaboration as well.
 
 * Vector memory operations are performed as wavefront wide operations, with the
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index 4455f686205a6..9d191531fa9e6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1292,9 +1292,16 @@ defm XF32Insts : AMDGPUSubtargetFeature<"xf32-insts",
   "v_mfma_f32_16x16x8_xf32 and v_mfma_f32_32x32x4_xf32"
 >;
 
-defm GloballyAddressableScratch : AMDGPUSubtargetFeature<"globally-addressable-scratch",
-  "FLAT instructions can access scratch memory for any thread in any wave",
-  /*GenPredicate=*/0
+def FeatureGloballyAddressableScratchSupport : SubtargetFeature<"globally-addressable-scratch-support",
+  "HasGloballyAddressableScratchSupport",
+  "true",
+  "Hardware supports globally-addressable-scratch"
+>;
+
+def FeatureGloballyAddressableScratch : SubtargetFeature<"globally-addressable-scratch",
+  "EnableGloballyAddressableScratch",
+  "true",
+  "FLAT instructions can access scratch memory from any thread in any wave"
 >;
 
 // Enable the use of SCRATCH_STORE/LOAD_BLOCK instructions for saving and
@@ -2088,7 +2095,7 @@ def FeatureISAVersion12_50_Common : FeatureSet<
    FeatureFlatBufferGlobalAtomicFaddF64Inst,
    FeatureMemoryAtomicFAddF32DenormalSupport,
    FeatureEmulatedSystemScopeAtomics,
-   FeatureGloballyAddressableScratch,
+   FeatureGloballyAddressableScratchSupport,
    FeatureKernargPreload,
    FeatureVmemPrefInsts,
    FeatureLshlAddU64Inst,
@@ -2190,7 +2197,7 @@ def FeatureISAVersion13 : FeatureSet<
    FeatureAtomicFMinFMaxF64GlobalInsts,
    FeatureAtomicFMinFMaxF64FlatInsts,
    FeatureFmaMixBF16Insts,
-   FeatureGloballyAddressableScratch,
+   FeatureGloballyAddressableScratchSupport,
    FeatureCvtPkF16F32Inst,
    FeatureF16BF16ToFP6BF6ConversionScaleInsts,
    FeatureIEEEMinimumMaximumInsts,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index c78ef16b00983..4025c9d92bbad 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -2356,7 +2356,7 @@ Register AMDGPULegalizerInfo::getSegmentAperture(
                                        ? AMDGPU::SRC_SHARED_BASE
                                        : AMDGPU::SRC_PRIVATE_BASE;
     assert((ApertureRegNo != AMDGPU::SRC_PRIVATE_BASE ||
-            !ST.hasGloballyAddressableScratch()) &&
+            !ST.hasGloballyAddressableScratchSupport()) &&
            "Cannot use src_private_base with globally addressable scratch!");
     Register Dst = MRI.createGenericVirtualRegister(S64);
     MRI.setRegClass(Dst, &AMDGPU::SReg_64RegClass);
@@ -2481,7 +2481,7 @@ bool AMDGPULegalizerInfo::legalizeAddrSpaceCast(
        DestAS == AMDGPUAS::PRIVATE_ADDRESS)) {
     auto castFlatToLocalOrPrivate = [&](const DstOp &Dst) -> Register {
       if (DestAS == AMDGPUAS::PRIVATE_ADDRESS &&
-          ST.hasGloballyAddressableScratch()) {
+          ST.hasGloballyAddressableScratchSupport()) {
         // flat -> private with globally addressable scratch: subtract
         // src_flat_scratch_base_lo.
         const LLT S32 = LLT::scalar(32);
@@ -2532,7 +2532,7 @@ bool AMDGPULegalizerInfo::legalizeAddrSpaceCast(
       Register SrcAsInt = B.buildPtrToInt(S32, Src).getReg(0);
 
       if (SrcAS == AMDGPUAS::PRIVATE_ADDRESS &&
-          ST.hasGloballyAddressableScratch()) {
+          ST.hasGloballyAddressableScratchSupport()) {
         // For wave32: Addr = (TID[4:0] << 52) + FLAT_SCRATCH_BASE + privateAddr
         // For wave64: Addr = (TID[5:0] << 51) + FLAT_SCRATCH_BASE + privateAddr
         Register AllOnes = B.buildConstant(S32, -1).getReg(0);
@@ -6370,7 +6370,7 @@ bool AMDGPULegalizerInfo::legalizeIsAddrSpace(MachineInstr &MI,
   Register Hi32 = Unmerge.getReg(1);
 
   if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS &&
-      ST.hasGloballyAddressableScratch()) {
+      ST.hasGloballyAddressableScratchSupport()) {
     Register FlatScratchBaseHi =
         B.buildInstr(AMDGPU::S_MOV_B32, {S32},
                      {Register(AMDGPU::SRC_FLAT_SCRATCH_BASE_HI)})
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index d02bc45bc14f6..6750be3031da9 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -1061,7 +1061,7 @@ bool GCNTTIImpl::isSourceOfDivergence(const Value *V) const {
       unsigned DstAS = Intrinsic->getType()->getPointerAddressSpace();
       return SrcAS == AMDGPUAS::PRIVATE_ADDRESS &&
              DstAS == AMDGPUAS::FLAT_ADDRESS &&
-             ST->hasGloballyAddressableScratch();
+             ST->hasGloballyAddressableScratchSupport();
     }
     case Intrinsic::amdgcn_workitem_id_y:
     case Intrinsic::amdgcn_workitem_id_z: {
@@ -1094,7 +1094,7 @@ bool GCNTTIImpl::isSourceOfDivergence(const Value *V) const {
   if (auto *CastI = dyn_cast<AddrSpaceCastInst>(V)) {
     return CastI->getSrcAddressSpace() == AMDGPUAS::PRIVATE_ADDRESS &&
            CastI->getDestAddressSpace() == AMDGPUAS::FLAT_ADDRESS &&
-           ST->hasGloballyAddressableScratch();
+           ST->hasGloballyAddressableScratchSupport();
   }
 
   return false;
diff --git a/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp b/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
index 2236a98c58330..06311cad96efa 100644
--- a/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
+++ b/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
@@ -1612,8 +1612,8 @@ class AMDGPUAsmParser : public MCTargetAsmParser {
     return getFeatureBits()[AMDGPU::FeaturePartialNSAEncoding];
   }
 
-  bool hasGloballyAddressableScratch() const {
-    return getFeatureBits()[AMDGPU::FeatureGloballyAddressableScratch];
+  bool hasGloballyAddressableScratchSupport() const {
+    return getFeatureBits()[AMDGPU::FeatureGloballyAddressableScratchSupport]...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/189555