[llvm-branch-commits] [llvm] [AMDGPU] Add amdgcn.av.global.(load|store).b128 intrinsics (PR #191390)
via llvm-branch-commits
llvm-branch-commits at lists.llvm.org
Fri Apr 10 03:58:45 PDT 2026
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-llvm-ir
Author: Sameer Sahasrabuddhe (ssahasra)
<details>
<summary>Changes</summary>
The two new intrinsica llvm.amdgcn.global.load.b128 and llvm.amdgcn.global.store.b128 have availability and visibility semantics as described in #<!-- -->191246. Each of them takes a scope operand that is then translated to target-specific cache policy bits. This allows the user to control how the side-effects of these loads and stores are made visible to other threads.
This is patch was extracted from #<!-- -->172090.
Co-authored-by: macurtis-amd <macurtis@<!-- -->amd.com>
Assisted-by: Claude Opus 4.6
---
Patch is 1.76 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/191390.diff
12 Files Affected:
- (modified) llvm/docs/AMDGPUUsage.rst (+105)
- (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+25)
- (modified) llvm/lib/IR/Verifier.cpp (+11)
- (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+3)
- (modified) llvm/lib/Target/AMDGPU/FLATInstructions.td (+20-7)
- (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+23)
- (added) llvm/test/CodeGen/AMDGPU/amdgcn-av-scopes.ll (+741)
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.av.global.load.b128.ll (+30869)
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.av.global.store.b128.ll (+3888)
- (added) llvm/test/CodeGen/AMDGPU/unsupported-av-global-load.ll (+22)
- (added) llvm/test/CodeGen/AMDGPU/unsupported-av-global-store.ll (+22)
- (added) llvm/test/Verifier/AMDGPU/intrinsics-av.ll (+15)
``````````diff
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 84db071445f53..bb7b43401f954 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -1775,6 +1775,111 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
* :ref:`Synchronization Scope<amdgpu-intrinsics-syncscope-metadata-operand>`.
Note that the scope used must ensure that the L2 cache will be hit.
+ llvm.amdgcn.av.global.load.b128 This intrinsic is supported on gfx9, gfx10, gfx11, and gfx12 targets.
+
+ Signature:
+
+ .. code-block:: llvm
+
+ <4 x i32> @llvm.amdgcn.av.global.load.b128(
+ ptr addrspace(1), ; source
+ metadata) ; scope - e.g. '!0' where '!0 = !{!"wavegroup"}'
+
+ Reads the value from the source address with cache behavior specified by the scope.
+
+ The following table shows the mapping between valid scope values and target
+ instruction flags or field values.
+
+ ============== ========================== ========================== ========================== ========================== ==========================
+ targets instruction ``"wavefront"`` ``"workgroup"`` ``"agent"`` ``""`` (empty string)
+ ============== ========================== ========================== ========================== ========================== ==========================
+ gfx90* ``global_load_dwordx4`` ``glc`` ``glc``
+
+ gfx942, gfx950 ``global_load_dwordx4`` (wave) ``sc0`` (group) ``sc1`` (device) ``sc0 sc1`` (system)
+
+ gfx10* ``global_load_dwordx4`` ``glc`` ``glc dlc`` ``glc dlc``
+
+ gfx11* ``global_load_dwordx4`` ``glc`` ``glc`` ``glc``
+
+ gfx120* ``av_global_load_b128`` (CU) ``scope:SCOPE_SE`` (SE) ``scope:SCOPE_DEV`` (DEV) ``scope:SCOPE_SYS`` (SYS)
+
+ gfx125* ``av_global_load_b128`` (CU) ``scope:SCOPE_DEV`` (DEV) ``scope:SCOPE_SYS`` (SYS)
+ ============== ========================== ========================== ========================== ========================== ==========================
+
+ For gfx90*, see "GLC Bit Explained" in the appropriate instruction set reference
+ (e.g. Chapter 9.1.10 in "AMD Instinct MI100" Instruction Set Architecture Reference
+ Guide).
+
+ For gfx942 and gfx950 targets, see "Memory Scope and Temporal Controls" in the
+ appropriate instruction set reference (e.g. Chapter 9.1.10.2 in the "AMD Instinct
+ MI300" Instruction Set Architecture Reference Guide).
+
+ For gfx10* targets, see "GLC, DLC and SLC Bit Explained" in the appropriate
+ instruction set reference (e.g. Chapter 8.1.10 in "RDNA 2" Instruction Set Architecture
+ Reference Guide)
+
+ For gfx11* targets, see "Cache Controls: SLC, GLC and DLC" in the appropriate
+ instruction set reference (e.g. Chapter 4.1.1 in "RDNA3" Instruction Set Architecture
+ Reference Guide).
+
+ For gfx12* targets, see "Cache Controls: SCOPE and Temporal-Hint" in the
+ appropriate instruction set reference (e.g. Chapter 4.1.1 in the "RDNA4"
+ Instruction Set Architecture Reference Guide).
+
+
+ llvm.amdgcn.av.global.store.b128 This intrinsic is supported on gfx9, gfx10, gfx11, and gfx12 targets.
+
+ Signature:
+
+ .. code-block:: llvm
+
+ void @llvm.amdgcn.av.global.store.b128(
+ ptr addrspace(1), ; destination
+ <4 x i32>, ; value
+ metadata) ; scope - e.g. '!0' where '!0 = !{!"wavegroup"}'
+
+ Writes the value to the destination address with cache
+ behavior specified by the scope.
+
+ The following table shows the mapping between valid scope values and target
+ instruction flags or field values.
+
+ ============== ========================== ========================== ========================== ========================== ==========================
+ targets instruction ``"wavefront"`` ``"workgroup"`` ``"agent"`` ``""`` (empty string)
+ ============== ========================== ========================== ========================== ========================== ==========================
+ gfx90* ``global_store_dwordx4``
+
+ gfx942, gfx950 ``global_store_dwordx4`` (wave) ``sc0`` (group) ``sc1`` (device) ``sc0 sc1`` (system)
+
+ gfx10* ``global_store_dwordx4``
+
+ gfx11* ``global_store_dwordx4``
+
+ gfx120* ``av_global_store_b128`` (CU) ``scope:SCOPE_SE`` (SE) ``scope:SCOPE_DEV`` (DEV) ``scope:SCOPE_SYS`` (SYS)
+
+ gfx125* ``av_global_store_b128`` (CU) ``scope:SCOPE_DEV`` (DEV) ``scope:SCOPE_SYS`` (SYS)
+ ============== ========================== ========================== ========================== ========================== ==========================
+
+ For gfx90*, see "GLC Bit Explained" in the appropriate instruction set reference
+ (e.g. Chapter 9.1.10 in "AMD Instinct MI100" Instruction Set Architecture Reference
+ Guide).
+
+ For gfx942 and gfx950 targets, see "Memory Scope and Temporal Controls" in the
+ appropriate instruction set reference (e.g. Chapter 9.1.10.2 in the "AMD Instinct
+ MI300" Instruction Set Architecture Reference Guide).
+
+ For gfx10* targets, see "GLC, DLC and SLC Bit Explained" in the appropriate
+ instruction set reference (e.g. Chapter 8.1.10 in "RDNA 2" Instruction Set
+ Architecture Reference Guide)
+
+ For gfx11* targets, see "Cache Controls: SLC, GLC and DLC" in the appropriate
+ instruction set reference (e.g. Chapter 4.1.1 in "RDNA3" Instruction Set
+ Architecture Reference Guide).
+
+ For gfx12* targets, see "Cache Controls: SCOPE and Temporal-Hint" in the
+ appropriate instruction set reference (e.g. Chapter 4.1.1 in the "RDNA4"
+ Instruction Set Architecture Reference Guide).
+
============================================== ==========================================================
.. TODO::
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 9528fb2b446bc..9fd2fee7c82fe 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -902,6 +902,31 @@ def int_amdgcn_bitop3 :
[LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>, llvm_i32_ty],
[ImmArg<ArgIndex<3>>]>;
+class AMDGPUAVGlobalStore : Intrinsic <
+ [],
+ [global_ptr_ty, // Base global pointer to store to
+ llvm_v4i32_ty, // Data to store
+ llvm_metadata_ty], // Scope
+ [ IntrWriteMem, WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>,
+ NonNull<ArgIndex<0>>, IntrWillReturn, IntrNoCallback, IntrNoFree ],
+ "",
+ [SDNPMemOperand, SDNPMayStore]
+>;
+
+def int_amdgcn_av_global_store_b128 : AMDGPUAVGlobalStore;
+
+class AMDGPUAVGlobalLoad : Intrinsic <
+ [llvm_v4i32_ty],
+ [global_ptr_ty, // Base global pointer to load from
+ llvm_metadata_ty], // Scope
+ [ IntrReadMem, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>,
+ NonNull<ArgIndex<0>>, IntrWillReturn, IntrNoCallback, IntrNoFree ],
+ "",
+ [SDNPMemOperand, SDNPMayLoad]
+>;
+
+def int_amdgcn_av_global_load_b128 : AMDGPUAVGlobalLoad;
+
} // TargetPrefix = "amdgcn"
// New-style image intrinsics
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index a86e8fdb7d73a..13294a6bb4bbc 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -7239,6 +7239,17 @@ void Verifier::visitIntrinsicCall(Intrinsic::ID ID, CallBase &Call) {
&Call, Op);
break;
}
+ case Intrinsic::amdgcn_av_global_load_b128:
+ case Intrinsic::amdgcn_av_global_store_b128: {
+ // Last argument must be a MD string
+ auto *Op = cast<MetadataAsValue>(Call.getArgOperand(Call.arg_size() - 1));
+ auto *MD = dyn_cast<MDNode>(Op->getMetadata());
+ Check(MD && (MD->getNumOperands() == 1) && isa<MDString>(MD->getOperand(0)),
+ "global load/store intrinsics require that the last argument is a "
+ "metadata string",
+ &Call, Op);
+ break;
+ }
case Intrinsic::nvvm_setmaxnreg_inc_sync_aligned_u32:
case Intrinsic::nvvm_setmaxnreg_dec_sync_aligned_u32: {
Value *V = Call.getArgOperand(0);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index ebdd709c34f08..da19a8c14f197 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -5668,6 +5668,9 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
case Intrinsic::amdgcn_flat_prefetch:
case Intrinsic::amdgcn_global_prefetch:
return getDefaultMappingVOP(MI);
+ case Intrinsic::amdgcn_av_global_load_b128:
+ case Intrinsic::amdgcn_av_global_store_b128:
+ return getDefaultMappingAllVGPR(MI);
default:
return getInvalidInstructionMapping();
}
diff --git a/llvm/lib/Target/AMDGPU/FLATInstructions.td b/llvm/lib/Target/AMDGPU/FLATInstructions.td
index c0fb73df9c764..c156ec148de76 100644
--- a/llvm/lib/Target/AMDGPU/FLATInstructions.td
+++ b/llvm/lib/Target/AMDGPU/FLATInstructions.td
@@ -1827,6 +1827,19 @@ multiclass GlobalFLATStorePats<FLAT_Pseudo inst, SDPatternOperator node,
}
}
+def av_global_load_b128_intrin_pat : PatFrag<
+ (ops node:$ptr),
+ (int_amdgcn_av_global_load_b128 $ptr, srcvalue)>;
+
+def av_global_store_b128_intrin_pat : PatFrag<
+ (ops node:$data, node:$ptr),
+ (int_amdgcn_av_global_store_b128 $ptr, $data, srcvalue)>;
+
+let SubtargetPredicate = HasFlatGlobalInsts in {
+defm : GlobalFLATLoadPats <GLOBAL_LOAD_DWORDX4, av_global_load_b128_intrin_pat, v4i32>;
+defm : GlobalFLATStorePats <GLOBAL_STORE_DWORDX4, av_global_store_b128_intrin_pat, v4i32>;
+}
+
multiclass GlobalFLATStorePats_D16_t16<string inst, SDPatternOperator node, ValueType vt> {
def : FlatStoreSignedPat<!cast<FLAT_Pseudo>(inst#"_t16"), node, vt> {
let AddedComplexity = 10;
@@ -3326,13 +3339,13 @@ defm GLOBAL_LOAD_SSHORT : GLOBAL_Real_AllAddr_gfx11<0x013, "global_load_
defm GLOBAL_LOAD_DWORD : GLOBAL_Real_AllAddr_gfx11<0x014, "global_load_b32">;
defm GLOBAL_LOAD_DWORDX2 : GLOBAL_Real_AllAddr_gfx11<0x015, "global_load_b64">;
defm GLOBAL_LOAD_DWORDX3 : GLOBAL_Real_AllAddr_gfx11<0x016, "global_load_b96">;
-defm GLOBAL_LOAD_DWORDX4 : GLOBAL_Real_AllAddr_gfx11<0x017, "global_load_b128">;
+defm GLOBAL_LOAD_DWORDX4 : GLOBAL_Real_AllAddr_gfx11<0x017, "av_global_load_b128">;
defm GLOBAL_STORE_BYTE : GLOBAL_Real_AllAddr_gfx11<0x018, "global_store_b8">;
defm GLOBAL_STORE_SHORT : GLOBAL_Real_AllAddr_gfx11<0x019, "global_store_b16">;
defm GLOBAL_STORE_DWORD : GLOBAL_Real_AllAddr_gfx11<0x01a, "global_store_b32">;
defm GLOBAL_STORE_DWORDX2 : GLOBAL_Real_AllAddr_gfx11<0x01b, "global_store_b64">;
defm GLOBAL_STORE_DWORDX3 : GLOBAL_Real_AllAddr_gfx11<0x01c, "global_store_b96">;
-defm GLOBAL_STORE_DWORDX4 : GLOBAL_Real_AllAddr_gfx11<0x01d, "global_store_b128">;
+defm GLOBAL_STORE_DWORDX4 : GLOBAL_Real_AllAddr_gfx11<0x01d, "av_global_store_b128">;
defm GLOBAL_LOAD_UBYTE_D16 : GLOBAL_Real_AllAddr_gfx11<0x01e, "global_load_d16_u8">;
defm GLOBAL_LOAD_SBYTE_D16 : GLOBAL_Real_AllAddr_gfx11<0x01f, "global_load_d16_i8">;
defm GLOBAL_LOAD_SHORT_D16 : GLOBAL_Real_AllAddr_gfx11<0x020, "global_load_d16_b16">;
@@ -3535,13 +3548,13 @@ defm GLOBAL_LOAD_SSHORT : VFLAT_Real_AllAddr_gfx12<0x013, "global_loa
defm GLOBAL_LOAD_DWORD : VFLAT_Real_AllAddr_gfx12<0x014, "global_load_b32">;
defm GLOBAL_LOAD_DWORDX2 : VFLAT_Real_AllAddr_gfx12<0x015, "global_load_b64">;
defm GLOBAL_LOAD_DWORDX3 : VFLAT_Real_AllAddr_gfx12<0x016, "global_load_b96">;
-defm GLOBAL_LOAD_DWORDX4 : VFLAT_Real_AllAddr_gfx12<0x017, "global_load_b128">;
+defm GLOBAL_LOAD_DWORDX4 : VFLAT_Real_AllAddr_gfx12<0x017, "av_global_load_b128">;
defm GLOBAL_STORE_BYTE : VFLAT_Real_AllAddr_gfx12<0x018, "global_store_b8">;
defm GLOBAL_STORE_SHORT : VFLAT_Real_AllAddr_gfx12<0x019, "global_store_b16">;
defm GLOBAL_STORE_DWORD : VFLAT_Real_AllAddr_gfx12<0x01a, "global_store_b32">;
defm GLOBAL_STORE_DWORDX2 : VFLAT_Real_AllAddr_gfx12<0x01b, "global_store_b64">;
defm GLOBAL_STORE_DWORDX3 : VFLAT_Real_AllAddr_gfx12<0x01c, "global_store_b96">;
-defm GLOBAL_STORE_DWORDX4 : VFLAT_Real_AllAddr_gfx12<0x01d, "global_store_b128">;
+defm GLOBAL_STORE_DWORDX4 : VFLAT_Real_AllAddr_gfx12<0x01d, "av_global_store_b128">;
defm GLOBAL_LOAD_UBYTE_D16 : VFLAT_Real_AllAddr_gfx12<0x01e, "global_load_d16_u8">;
defm GLOBAL_LOAD_SBYTE_D16 : VFLAT_Real_AllAddr_gfx12<0x01f, "global_load_d16_i8">;
defm GLOBAL_LOAD_SHORT_D16 : VFLAT_Real_AllAddr_gfx12<0x020, "global_load_d16_b16">;
@@ -3709,7 +3722,7 @@ defm GLOBAL_LOAD_TR4_B64 : VFLAT_Real_AllAddr_gfx1250<0x073>;
defm GLOBAL_LOAD_TR6_B96 : VFLAT_Real_AllAddr_gfx1250<0x074>;
// Additional aliases for global load transpose instructions.
-def : MnemonicAlias<"global_load_b128_tr_b16", "global_load_tr16_b128">, Requires<[isGFX125xOnly]>;
+def : MnemonicAlias<"av_global_load_b128_tr_b16", "global_load_tr16_b128">, Requires<[isGFX125xOnly]>;
def : MnemonicAlias<"global_load_b64_tr_b8", "global_load_tr8_b64">, Requires<[isGFX125xOnly]>;
def : MnemonicAlias<"global_load_b64_tr_b4", "global_load_tr4_b64">, Requires<[isGFX125xOnly]>;
def : MnemonicAlias<"global_load_b96_tr_b6", "global_load_tr6_b96">, Requires<[isGFX125xOnly]>;
@@ -3852,7 +3865,7 @@ defm GLOBAL_LOAD_USHORT : VFLAT_Real_AllAddr_gfx13<0x0a, "global_l
defm GLOBAL_LOAD_SSHORT : VFLAT_Real_AllAddr_gfx13<0x0b, "global_load_i16">;
defm GLOBAL_LOAD_DWORD : VFLAT_Real_AllAddr_gfx13<0x0c, "global_load_b32">;
defm GLOBAL_LOAD_DWORDX2 : VFLAT_Real_AllAddr_gfx13<0x0d, "global_load_b64">;
-defm GLOBAL_LOAD_DWORDX4 : VFLAT_Real_AllAddr_gfx13<0x0e, "global_load_b128">;
+defm GLOBAL_LOAD_DWORDX4 : VFLAT_Real_AllAddr_gfx13<0x0e, "av_global_load_b128">;
defm GLOBAL_LOAD_DWORDX3 : VFLAT_Real_AllAddr_gfx13<0x0f, "global_load_b96">;
defm GLOBAL_LOAD_DWORD_ADDTID : VFLAT_Real_AllAddr_gfx13<0x16, "global_load_addtid_b32">;
defm GLOBAL_STORE_DWORD_ADDTID : VFLAT_Real_AllAddr_gfx13<0x17, "global_store_addtid_b32">;
@@ -3862,7 +3875,7 @@ defm GLOBAL_STORE_SHORT : VFLAT_Real_AllAddr_gfx13<0x1a, "global_s
defm GLOBAL_STORE_SHORT_D16_HI : VFLAT_Real_AllAddr_gfx13<0x1b, "global_store_d16_hi_b16">;
defm GLOBAL_STORE_DWORD : VFLAT_Real_AllAddr_gfx13<0x1c, "global_store_b32">;
defm GLOBAL_STORE_DWORDX2 : VFLAT_Real_AllAddr_gfx13<0x1d, "global_store_b64">;
-defm GLOBAL_STORE_DWORDX4 : VFLAT_Real_AllAddr_gfx13<0x1e, "global_store_b128">;
+defm GLOBAL_STORE_DWORDX4 : VFLAT_Real_AllAddr_gfx13<0x1e, "av_global_store_b128">;
defm GLOBAL_STORE_DWORDX3 : VFLAT_Real_AllAddr_gfx13<0x1f, "global_store_b96">;
defm GLOBAL_LOAD_UBYTE_D16 : VFLAT_Real_AllAddr_gfx13<0x20, "global_load_d16_u8">;
defm GLOBAL_LOAD_UBYTE_D16_HI : VFLAT_Real_AllAddr_gfx13<0x21, "global_load_d16_hi_u8">;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 6750dfcbaac62..2393f2aa47244 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1718,6 +1718,27 @@ void SITargetLowering::getTgtMemIntrinsic(SmallVectorImpl<IntrinsicInfo> &Infos,
Infos.push_back(Info);
return;
}
+ case Intrinsic::amdgcn_av_global_load_b128:
+ case Intrinsic::amdgcn_av_global_store_b128: {
+ bool IsStore = IntrID == Intrinsic::amdgcn_av_global_store_b128;
+ Info.opc = IsStore ? ISD::INTRINSIC_VOID : ISD::INTRINSIC_W_CHAIN;
+ Info.memVT = MVT::v4i32;
+ Info.ptrVal = CI.getArgOperand(0);
+ Info.flags |=
+ IsStore ? MachineMemOperand::MOStore : MachineMemOperand::MOLoad;
+ // Pretend to be atomic so that SIMemoryLegalizer::expandStore sets cache
+ // flags appropriately.
+ Info.order = AtomicOrdering::Monotonic;
+
+ LLVMContext &Ctx = CI.getContext();
+ unsigned ScopeIdx = CI.arg_size() - 1;
+ MDNode *ScopeMD = cast<MDNode>(
+ cast<MetadataAsValue>(CI.getArgOperand(ScopeIdx))->getMetadata());
+ StringRef Scope = cast<MDStri...
[truncated]
``````````
</details>
https://github.com/llvm/llvm-project/pull/191390
More information about the llvm-branch-commits
mailing list