[llvm] [AMDGPU] Make globally-addressable-scratch opt-in (PR #189555)
via llvm-commits
llvm-commits at lists.llvm.org
Tue Mar 31 01:05:15 PDT 2026
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-llvm-transforms
Author: Pierre van Houtryve (Pierre-vh)
<details>
<summary>Changes</summary>
This feature is meant to be opt-in for more advanced users, not default-enabled.
It may reduce performance otherwise as we can't assume private AS is thread-local
when it is enabled.
- Add `HasGloballyAddressableScratchSupport` feature to check if a target's scratch
addressing is changed due to support for globally addressable scratch.
- Use `EnableGloballyAddressableScratch` to check whether the user opted into
globally addressable scratch. This affects whether to lower scratch atomics as flat,
and in the future will affect whether NV=1 can be set on scratch accesses.
---
Patch is 2.56 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/189555.diff
19 Files Affected:
- (modified) llvm/docs/AMDGPUUsage.rst (+72-72)
- (modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+12-5)
- (modified) llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp (+4-4)
- (modified) llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp (+2-2)
- (modified) llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp (+3-3)
- (modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+5)
- (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+7-7)
- (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+1-1)
- (modified) llvm/lib/Target/AMDGPU/SIMemoryLegalizer.cpp (+1-1)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-agent.ll (+4816-3612)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-cluster.ll (+4595-3391)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-lastuse.ll (+5-1)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-nontemporal.ll (+5-1)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-singlethread.ll (+4412-3193)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-system.ll (+4584-3436)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-volatile.ll (+5-1)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-wavefront.ll (+4412-3193)
- (modified) llvm/test/CodeGen/AMDGPU/memory-legalizer-private-workgroup.ll (+4564-3345)
- (modified) llvm/test/Transforms/AtomicExpand/AMDGPU/expand-atomic-private-gas.ll (+152-69)
``````````diff
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 1ede5ca2d4cf6..388e1b33fe12d 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -541,27 +541,21 @@ Every processor supports every OS ABI (see :ref:`amdgpu-os`) with the following
work-item
IDs
- ``gfx1250`` ``amdgcn`` APU - Architected *TBA*
- flat
- scratch .. TODO::
+ ``gfx1250`` ``amdgcn`` APU - globally- - Architected *TBA*
+ addressable- flat
+ scratch scratch .. TODO::
- Packed
work-item Add product
IDs names.
- - Globally
- Accessible
- Scratch
- Workgroup
Clusters
- ``gfx1251`` ``amdgcn`` APU - Architected *TBA*
- flat
- scratch .. TODO::
+ ``gfx1251`` ``amdgcn`` APU - globally- - Architected *TBA*
+ addressable- flat
+ scratch scratch .. TODO::
- Packed
work-item Add product
IDs names.
- - Globally
- Accessible
- Scratch
- Workgroup
Clusters
@@ -753,64 +747,70 @@ For example:
.. table:: AMDGPU Target Features
:name: amdgpu-target-features-table
- =============== ============================ ==================================================
- Target Feature Clang Option to Control Description
+ ============================= ============================ ==================================================
+ Target Feature Clang Option to Control Description
Name
- =============== ============================ ==================================================
- cumode - ``-m[no-]cumode`` Control the wavefront execution mode used
- when generating code for kernels. When disabled
- native WGP wavefront execution mode is used,
- when enabled CU wavefront execution mode is used
- (see :ref:`amdgpu-amdhsa-memory-model`).
-
- sramecc - ``-mcpu`` If specified, generate code that can only be
- - ``--offload-arch`` loaded and executed in a process that has a
- matching setting for SRAMECC.
-
- If not specified for code object V2 to V3, generate
- code that can be loaded and executed in a process
- with SRAMECC enabled.
-
- If not specified for code object V4 or above, generate
- code that can be loaded and executed in a process
- with either setting of SRAMECC.
-
- tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes
- work-groups are launched in threadgroup split mode.
- When enabled the waves of a work-group may be
- launched in different CUs.
-
- wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
- generating code for kernels. When disabled
- native wavefront size 32 is used, when enabled
- wavefront size 64 is used.
-
- xnack - ``-mcpu`` If specified, generate code that can only be
- - ``--offload-arch`` loaded and executed in a process that has a
- matching setting for XNACK replay.
-
- If not specified for code object V2 to V3, generate
- code that can be loaded and executed in a process
- with XNACK replay enabled.
-
- If not specified for code object V4 or above, generate
- code that can be loaded and executed in a process
- with either setting of XNACK replay.
-
- XNACK replay can be used for demand paging and
- page migration. If enabled in the device, then if
- a page fault occurs the code may execute
- incorrectly unless generated with XNACK replay
- enabled, or generated for code object V4 or above without
- specifying XNACK replay. Executing code that was
- generated with XNACK replay enabled, or generated
- for code object V4 or above without specifying XNACK replay,
- on a device that does not have XNACK replay
- enabled will execute correctly but may be less
- performant than code generated for XNACK replay
- disabled.
-
- =============== ============================ ==================================================
+ ============================= ============================ ==================================================
+ cumode - ``-m[no-]cumode`` Control the wavefront execution mode used
+ when generating code for kernels. When disabled
+ native WGP wavefront execution mode is used,
+ when enabled CU wavefront execution mode is used
+ (see :ref:`amdgpu-amdhsa-memory-model`).
+
+ sramecc - ``-mcpu`` If specified, generate code that can only be
+ - ``--offload-arch`` loaded and executed in a process that has a
+ matching setting for SRAMECC.
+
+ If not specified for code object V2 to V3, generate
+ code that can be loaded and executed in a process
+ with SRAMECC enabled.
+
+ If not specified for code object V4 or above, generate
+ code that can be loaded and executed in a process
+ with either setting of SRAMECC.
+
+ tgsplit ``-m[no-]tgsplit`` Enable/disable generating code that assumes
+ work-groups are launched in threadgroup split mode.
+ When enabled the waves of a work-group may be
+ launched in different CUs.
+
+ wavefrontsize64 - ``-m[no-]wavefrontsize64`` Control the wavefront size used when
+ generating code for kernels. When disabled
+ native wavefront size 32 is used, when enabled
+ wavefront size 64 is used.
+
+ xnack - ``-mcpu`` If specified, generate code that can only be
+ - ``--offload-arch`` loaded and executed in a process that has a
+ matching setting for XNACK replay.
+
+ If not specified for code object V2 to V3, generate
+ code that can be loaded and executed in a process
+ with XNACK replay enabled.
+
+ If not specified for code object V4 or above, generate
+ code that can be loaded and executed in a process
+ with either setting of XNACK replay.
+
+ XNACK replay can be used for demand paging and
+ page migration. If enabled in the device, then if
+ a page fault occurs the code may execute
+ incorrectly unless generated with XNACK replay
+ enabled, or generated for code object V4 or above without
+ specifying XNACK replay. Executing code that was
+ generated with XNACK replay enabled, or generated
+ for code object V4 or above without specifying XNACK replay,
+ on a device that does not have XNACK replay
+ enabled will execute correctly but may be less
+ performant than code generated for XNACK replay
+ disabled.
+
+ globally-addressable-scratch - ``--offload-arch`` When enabled, scratch (private) memory can be shared
+ between threads without triggering undefined behavior.
+ Disabled by default as this may incur a performance penalty
+ because the compiler can no longer assume private memory is
+ thread-local when this is enabled.
+
+ ============================= ============================ ==================================================
.. _amdgpu-target-id:
@@ -1009,7 +1009,7 @@ supported for the ``amdgcn`` target.
access is not supported except by flat and scratch instructions in
GFX9-GFX11.
- On targets without "Globally Accessible Scratch" (introduced in GFX125x), code that
+ On targets without ``globally-addressable-scratch``, or if the feature is disabled, code that
manipulates the stack values in other lanes of a wavefront, such as by
``addrspacecast``-ing stack pointers to generic ones and taking offsets that reach other
lanes or by explicitly constructing the scratch buffer descriptor, triggers undefined
@@ -17350,8 +17350,8 @@ For GFX125x:
This section is currently incomplete as work on the compiler is still ongoing.
The following is a non-exhaustive list of unimplemented/undocumented features:
- non-volatile bit code sequences, globally accessing scratch atomics,
- multicast loads, barriers (including split barriers) and cooperative atomics.
+ non-volatile bit code sequences, multicast loads, barriers (including split barriers)
+ and cooperative atomics.
Scalar operations memory model needs more elaboration as well.
* Vector memory operations are performed as wavefront wide operations, with the
diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index 4455f686205a6..9d191531fa9e6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -1292,9 +1292,16 @@ defm XF32Insts : AMDGPUSubtargetFeature<"xf32-insts",
"v_mfma_f32_16x16x8_xf32 and v_mfma_f32_32x32x4_xf32"
>;
-defm GloballyAddressableScratch : AMDGPUSubtargetFeature<"globally-addressable-scratch",
- "FLAT instructions can access scratch memory for any thread in any wave",
- /*GenPredicate=*/0
+def FeatureGloballyAddressableScratchSupport : SubtargetFeature<"globally-addressable-scratch-support",
+ "HasGloballyAddressableScratchSupport",
+ "true",
+ "Hardware supports globally-addressable-scratch"
+>;
+
+def FeatureGloballyAddressableScratch : SubtargetFeature<"globally-addressable-scratch",
+ "EnableGloballyAddressableScratch",
+ "true",
+ "FLAT instructions can access scratch memory from any thread in any wave"
>;
// Enable the use of SCRATCH_STORE/LOAD_BLOCK instructions for saving and
@@ -2088,7 +2095,7 @@ def FeatureISAVersion12_50_Common : FeatureSet<
FeatureFlatBufferGlobalAtomicFaddF64Inst,
FeatureMemoryAtomicFAddF32DenormalSupport,
FeatureEmulatedSystemScopeAtomics,
- FeatureGloballyAddressableScratch,
+ FeatureGloballyAddressableScratchSupport,
FeatureKernargPreload,
FeatureVmemPrefInsts,
FeatureLshlAddU64Inst,
@@ -2190,7 +2197,7 @@ def FeatureISAVersion13 : FeatureSet<
FeatureAtomicFMinFMaxF64GlobalInsts,
FeatureAtomicFMinFMaxF64FlatInsts,
FeatureFmaMixBF16Insts,
- FeatureGloballyAddressableScratch,
+ FeatureGloballyAddressableScratchSupport,
FeatureCvtPkF16F32Inst,
FeatureF16BF16ToFP6BF6ConversionScaleInsts,
FeatureIEEEMinimumMaximumInsts,
diff --git a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
index c78ef16b00983..4025c9d92bbad 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
@@ -2356,7 +2356,7 @@ Register AMDGPULegalizerInfo::getSegmentAperture(
? AMDGPU::SRC_SHARED_BASE
: AMDGPU::SRC_PRIVATE_BASE;
assert((ApertureRegNo != AMDGPU::SRC_PRIVATE_BASE ||
- !ST.hasGloballyAddressableScratch()) &&
+ !ST.hasGloballyAddressableScratchSupport()) &&
"Cannot use src_private_base with globally addressable scratch!");
Register Dst = MRI.createGenericVirtualRegister(S64);
MRI.setRegClass(Dst, &AMDGPU::SReg_64RegClass);
@@ -2481,7 +2481,7 @@ bool AMDGPULegalizerInfo::legalizeAddrSpaceCast(
DestAS == AMDGPUAS::PRIVATE_ADDRESS)) {
auto castFlatToLocalOrPrivate = [&](const DstOp &Dst) -> Register {
if (DestAS == AMDGPUAS::PRIVATE_ADDRESS &&
- ST.hasGloballyAddressableScratch()) {
+ ST.hasGloballyAddressableScratchSupport()) {
// flat -> private with globally addressable scratch: subtract
// src_flat_scratch_base_lo.
const LLT S32 = LLT::scalar(32);
@@ -2532,7 +2532,7 @@ bool AMDGPULegalizerInfo::legalizeAddrSpaceCast(
Register SrcAsInt = B.buildPtrToInt(S32, Src).getReg(0);
if (SrcAS == AMDGPUAS::PRIVATE_ADDRESS &&
- ST.hasGloballyAddressableScratch()) {
+ ST.hasGloballyAddressableScratchSupport()) {
// For wave32: Addr = (TID[4:0] << 52) + FLAT_SCRATCH_BASE + privateAddr
// For wave64: Addr = (TID[5:0] << 51) + FLAT_SCRATCH_BASE + privateAddr
Register AllOnes = B.buildConstant(S32, -1).getReg(0);
@@ -6370,7 +6370,7 @@ bool AMDGPULegalizerInfo::legalizeIsAddrSpace(MachineInstr &MI,
Register Hi32 = Unmerge.getReg(1);
if (AddrSpace == AMDGPUAS::PRIVATE_ADDRESS &&
- ST.hasGloballyAddressableScratch()) {
+ ST.hasGloballyAddressableScratchSupport()) {
Register FlatScratchBaseHi =
B.buildInstr(AMDGPU::S_MOV_B32, {S32},
{Register(AMDGPU::SRC_FLAT_SCRATCH_BASE_HI)})
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
index d02bc45bc14f6..6750be3031da9 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
@@ -1061,7 +1061,7 @@ bool GCNTTIImpl::isSourceOfDivergence(const Value *V) const {
unsigned DstAS = Intrinsic->getType()->getPointerAddressSpace();
return SrcAS == AMDGPUAS::PRIVATE_ADDRESS &&
DstAS == AMDGPUAS::FLAT_ADDRESS &&
- ST->hasGloballyAddressableScratch();
+ ST->hasGloballyAddressableScratchSupport();
}
case Intrinsic::amdgcn_workitem_id_y:
case Intrinsic::amdgcn_workitem_id_z: {
@@ -1094,7 +1094,7 @@ bool GCNTTIImpl::isSourceOfDivergence(const Value *V) const {
if (auto *CastI = dyn_cast<AddrSpaceCastInst>(V)) {
return CastI->getSrcAddressSpace() == AMDGPUAS::PRIVATE_ADDRESS &&
CastI->getDestAddressSpace() == AMDGPUAS::FLAT_ADDRESS &&
- ST->hasGloballyAddressableScratch();
+ ST->hasGloballyAddressableScratchSupport();
}
return false;
diff --git a/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp b/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
index 2236a98c58330..06311cad96efa 100644
--- a/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
+++ b/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp
@@ -1612,8 +1612,8 @@ class AMDGPUAsmParser : public MCTargetAsmParser {
return getFeatureBits()[AMDGPU::FeaturePartialNSAEncoding];
}
- bool hasGloballyAddressableScratch() const {
- return getFeatureBits()[AMDGPU::FeatureGloballyAddressableScratch];
+ bool hasGloballyAddressableScratchSupport() const {
+ return getFeatureBits()[AMDGPU::FeatureGloballyAddressableScratchSupport]...
[truncated]
``````````
</details>
https://github.com/llvm/llvm-project/pull/189555
More information about the llvm-commits
mailing list