[llvm-branch-commits] [llvm] [AMDGPU] Add amdgcn.av.global.(load|store).b128 intrinsics (PR #191390)

Fri Apr 10 03:58:45 PDT 2026

llvmbot wrote:




@llvm/pr-subscribers-llvm-ir

Author: Sameer Sahasrabuddhe (ssahasra)

<details>
<summary>Changes</summary>

The two new intrinsica llvm.amdgcn.global.load.b128 and llvm.amdgcn.global.store.b128 have availability and visibility semantics as described in #191246. Each of them takes a scope operand that is then translated to target-specific cache policy bits. This allows the user to control how the side-effects of these loads and stores are made visible to other threads.

This is patch was extracted from #172090.

Co-authored-by: macurtis-amd <macurtis@amd.com>
Assisted-by: Claude Opus 4.6

---

Patch is 1.76 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/191390.diff


12 Files Affected:

- (modified) llvm/docs/AMDGPUUsage.rst (+105) 
- (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+25) 
- (modified) llvm/lib/IR/Verifier.cpp (+11) 
- (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+3) 
- (modified) llvm/lib/Target/AMDGPU/FLATInstructions.td (+20-7) 
- (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+23) 
- (added) llvm/test/CodeGen/AMDGPU/amdgcn-av-scopes.ll (+741) 
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.av.global.load.b128.ll (+30869) 
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.av.global.store.b128.ll (+3888) 
- (added) llvm/test/CodeGen/AMDGPU/unsupported-av-global-load.ll (+22) 
- (added) llvm/test/CodeGen/AMDGPU/unsupported-av-global-store.ll (+22) 
- (added) llvm/test/Verifier/AMDGPU/intrinsics-av.ll (+15) 


``````````diff

diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 84db071445f53..bb7b43401f954 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -1775,6 +1775,111 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
                                                    * :ref:`Synchronization Scope<amdgpu-intrinsics-syncscope-metadata-operand>`.
                                                      Note that the scope used must ensure that the L2 cache will be hit.
 
+  llvm.amdgcn.av.global.load.b128                     This intrinsic is supported on gfx9, gfx10, gfx11, and gfx12 targets.
+
+                                                   Signature:
+
+                                                   .. code-block:: llvm
+
+                                                      <4 x i32> @llvm.amdgcn.av.global.load.b128(
+                                                          ptr addrspace(1), ; source
+                                                          metadata)         ; scope    - e.g. '!0' where '!0 = !{!"wavegroup"}'
+
+                                                   Reads the value from the source address with cache behavior specified by the scope.
+
+                                                   The following table shows the mapping between valid scope values and target
+                                                   instruction flags or field values.
+
+                                                   ============== ========================== ========================== ========================== ========================== ==========================
+                                                   targets        instruction                           ``"wavefront"``            ``"workgroup"``                ``"agent"``      ``""`` (empty string)
+                                                   ============== ========================== ========================== ========================== ========================== ==========================
+                                                   gfx90*         ``global_load_dwordx4``                                                                             ``glc``                    ``glc``
+
+                                                   gfx942, gfx950 ``global_load_dwordx4``                        (wave)            ``sc0`` (group)           ``sc1`` (device)       ``sc0 sc1`` (system)
+
+                                                   gfx10*         ``global_load_dwordx4``                                                  ``glc``                ``glc dlc``                ``glc dlc``
+
+                                                   gfx11*         ``global_load_dwordx4``                                                  ``glc``                    ``glc``                    ``glc``
+
+                                                   gfx120*        ``av_global_load_b128``                             (CU)    ``scope:SCOPE_SE`` (SE)  ``scope:SCOPE_DEV`` (DEV)  ``scope:SCOPE_SYS`` (SYS)
+
+                                                   gfx125*        ``av_global_load_b128``                             (CU)                             ``scope:SCOPE_DEV`` (DEV)  ``scope:SCOPE_SYS`` (SYS)
+                                                   ============== ========================== ========================== ========================== ========================== ==========================
+
+                                                   For gfx90*, see "GLC Bit Explained" in the appropriate instruction set reference
+                                                   (e.g. Chapter 9.1.10 in "AMD Instinct MI100" Instruction Set Architecture Reference
+                                                   Guide).
+
+                                                   For gfx942 and gfx950 targets, see "Memory Scope and Temporal Controls" in the
+                                                   appropriate instruction set reference (e.g. Chapter 9.1.10.2 in the "AMD Instinct
+                                                   MI300" Instruction Set Architecture Reference Guide).
+
+                                                   For gfx10* targets, see "GLC, DLC and SLC Bit Explained" in the appropriate
+                                                   instruction set reference (e.g. Chapter 8.1.10 in "RDNA 2" Instruction Set Architecture
+                                                   Reference Guide)
+
+                                                   For gfx11* targets, see "Cache Controls: SLC, GLC and DLC" in the appropriate
+                                                   instruction set reference (e.g. Chapter 4.1.1 in "RDNA3" Instruction Set Architecture
+                                                   Reference Guide).
+
+                                                   For gfx12* targets, see "Cache Controls: SCOPE and Temporal-Hint" in the
+                                                   appropriate instruction set reference (e.g. Chapter 4.1.1 in the "RDNA4"
+                                                   Instruction Set Architecture Reference Guide).
+
+
+  llvm.amdgcn.av.global.store.b128                    This intrinsic is supported on gfx9, gfx10, gfx11, and gfx12 targets.
+
+                                                   Signature:
+
+                                                   .. code-block:: llvm
+
+                                                      void @llvm.amdgcn.av.global.store.b128(
+                                                          ptr addrspace(1), ; destination
+                                                          <4 x i32>,        ; value
+                                                          metadata)         ; scope    - e.g. '!0' where '!0 = !{!"wavegroup"}'
+
+                                                   Writes the value to the destination address with cache
+                                                   behavior specified by the scope.
+
+                                                   The following table shows the mapping between valid scope values and target
+                                                   instruction flags or field values.
+
+                                                   ============== ========================== ========================== ========================== ========================== ==========================
+                                                   targets        instruction                           ``"wavefront"``            ``"workgroup"``                ``"agent"``      ``""`` (empty string)
+                                                   ============== ========================== ========================== ========================== ========================== ==========================
+                                                   gfx90*         ``global_store_dwordx4``
+
+                                                   gfx942, gfx950 ``global_store_dwordx4``                       (wave)            ``sc0`` (group)           ``sc1`` (device)       ``sc0 sc1`` (system)
+
+                                                   gfx10*         ``global_store_dwordx4``
+
+                                                   gfx11*         ``global_store_dwordx4``
+
+                                                   gfx120*        ``av_global_store_b128``                            (CU)    ``scope:SCOPE_SE`` (SE)  ``scope:SCOPE_DEV`` (DEV)  ``scope:SCOPE_SYS`` (SYS)
+
+                                                   gfx125*        ``av_global_store_b128``                            (CU)                             ``scope:SCOPE_DEV`` (DEV)  ``scope:SCOPE_SYS`` (SYS)
+                                                   ============== ========================== ========================== ========================== ========================== ==========================
+
+                                                   For gfx90*, see "GLC Bit Explained" in the appropriate instruction set reference
+                                                   (e.g. Chapter 9.1.10 in "AMD Instinct MI100" Instruction Set Architecture Reference
+                                                   Guide).
+
+                                                   For gfx942 and gfx950 targets, see "Memory Scope and Temporal Controls" in the
+                                                   appropriate instruction set reference (e.g. Chapter 9.1.10.2 in the "AMD Instinct
+                                                   MI300" Instruction Set Architecture Reference Guide).
+
+                                                   For gfx10* targets, see "GLC, DLC and SLC Bit Explained" in the appropriate
+                                                   instruction set reference (e.g. Chapter 8.1.10 in "RDNA 2" Instruction Set
+                                                   Architecture Reference Guide)
+
+                                                   For gfx11* targets, see "Cache Controls: SLC, GLC and DLC" in the appropriate
+                                                   instruction set reference (e.g. Chapter 4.1.1 in "RDNA3" Instruction Set
+                                                   Architecture Reference Guide).
+
+                                                   For gfx12* targets, see "Cache Controls: SCOPE and Temporal-Hint" in the
+                                                   appropriate instruction set reference (e.g. Chapter 4.1.1 in the "RDNA4"
+                                                   Instruction Set Architecture Reference Guide).
+
   ==============================================   ==========================================================
 
 .. TODO::
diff --git a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
index 9528fb2b446bc..9fd2fee7c82fe 100644
--- a/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
+++ b/llvm/include/llvm/IR/IntrinsicsAMDGPU.td
@@ -902,6 +902,31 @@ def int_amdgcn_bitop3 :
                         [LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>, llvm_i32_ty],
                         [ImmArg<ArgIndex<3>>]>;
 
+class AMDGPUAVGlobalStore : Intrinsic <
+  [],
+  [global_ptr_ty,          // Base global pointer to store to
+   llvm_v4i32_ty,          // Data to store
+   llvm_metadata_ty],      // Scope
+  [ IntrWriteMem, WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>,
+    NonNull<ArgIndex<0>>, IntrWillReturn, IntrNoCallback, IntrNoFree ],
+  "",
+  [SDNPMemOperand, SDNPMayStore]
+>;
+
+def int_amdgcn_av_global_store_b128 : AMDGPUAVGlobalStore;
+
+class AMDGPUAVGlobalLoad : Intrinsic <
+  [llvm_v4i32_ty],
+  [global_ptr_ty,          // Base global pointer to load from
+   llvm_metadata_ty],      // Scope
+  [ IntrReadMem, ReadOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>,
+    NonNull<ArgIndex<0>>, IntrWillReturn, IntrNoCallback, IntrNoFree ],
+  "",
+  [SDNPMemOperand, SDNPMayLoad]
+>;
+
+def int_amdgcn_av_global_load_b128 : AMDGPUAVGlobalLoad;
+
 } // TargetPrefix = "amdgcn"
 
 // New-style image intrinsics
diff --git a/llvm/lib/IR/Verifier.cpp b/llvm/lib/IR/Verifier.cpp
index a86e8fdb7d73a..13294a6bb4bbc 100644
--- a/llvm/lib/IR/Verifier.cpp
+++ b/llvm/lib/IR/Verifier.cpp
@@ -7239,6 +7239,17 @@ void Verifier::visitIntrinsicCall(Intrinsic::ID ID, CallBase &Call) {
           &Call, Op);
     break;
   }
+  case Intrinsic::amdgcn_av_global_load_b128:
+  case Intrinsic::amdgcn_av_global_store_b128: {
+    // Last argument must be a MD string
+    auto *Op = cast<MetadataAsValue>(Call.getArgOperand(Call.arg_size() - 1));
+    auto *MD = dyn_cast<MDNode>(Op->getMetadata());
+    Check(MD && (MD->getNumOperands() == 1) && isa<MDString>(MD->getOperand(0)),
+          "global load/store intrinsics require that the last argument is a "
+          "metadata string",
+          &Call, Op);
+    break;
+  }
   case Intrinsic::nvvm_setmaxnreg_inc_sync_aligned_u32:
   case Intrinsic::nvvm_setmaxnreg_dec_sync_aligned_u32: {
     Value *V = Call.getArgOperand(0);
diff --git a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
index ebdd709c34f08..da19a8c14f197 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp
@@ -5668,6 +5668,9 @@ AMDGPURegisterBankInfo::getInstrMapping(const MachineInstr &MI) const {
     case Intrinsic::amdgcn_flat_prefetch:
     case Intrinsic::amdgcn_global_prefetch:
       return getDefaultMappingVOP(MI);
+    case Intrinsic::amdgcn_av_global_load_b128:
+    case Intrinsic::amdgcn_av_global_store_b128:
+      return getDefaultMappingAllVGPR(MI);
     default:
       return getInvalidInstructionMapping();
     }
diff --git a/llvm/lib/Target/AMDGPU/FLATInstructions.td b/llvm/lib/Target/AMDGPU/FLATInstructions.td
index c0fb73df9c764..c156ec148de76 100644
--- a/llvm/lib/Target/AMDGPU/FLATInstructions.td
+++ b/llvm/lib/Target/AMDGPU/FLATInstructions.td
@@ -1827,6 +1827,19 @@ multiclass GlobalFLATStorePats<FLAT_Pseudo inst, SDPatternOperator node,
   }
 }
 
+def av_global_load_b128_intrin_pat : PatFrag<
+  (ops node:$ptr),
+  (int_amdgcn_av_global_load_b128 $ptr, srcvalue)>;
+
+def av_global_store_b128_intrin_pat : PatFrag<
+  (ops node:$data, node:$ptr),
+  (int_amdgcn_av_global_store_b128 $ptr, $data, srcvalue)>;
+
+let SubtargetPredicate = HasFlatGlobalInsts in {
+defm : GlobalFLATLoadPats <GLOBAL_LOAD_DWORDX4, av_global_load_b128_intrin_pat, v4i32>;
+defm : GlobalFLATStorePats <GLOBAL_STORE_DWORDX4, av_global_store_b128_intrin_pat, v4i32>;
+}
+
 multiclass GlobalFLATStorePats_D16_t16<string inst, SDPatternOperator node, ValueType vt> {
   def : FlatStoreSignedPat<!cast<FLAT_Pseudo>(inst#"_t16"), node, vt> {
     let AddedComplexity = 10;
@@ -3326,13 +3339,13 @@ defm GLOBAL_LOAD_SSHORT         : GLOBAL_Real_AllAddr_gfx11<0x013, "global_load_
 defm GLOBAL_LOAD_DWORD          : GLOBAL_Real_AllAddr_gfx11<0x014, "global_load_b32">;
 defm GLOBAL_LOAD_DWORDX2        : GLOBAL_Real_AllAddr_gfx11<0x015, "global_load_b64">;
 defm GLOBAL_LOAD_DWORDX3        : GLOBAL_Real_AllAddr_gfx11<0x016, "global_load_b96">;
-defm GLOBAL_LOAD_DWORDX4        : GLOBAL_Real_AllAddr_gfx11<0x017, "global_load_b128">;
+defm GLOBAL_LOAD_DWORDX4        : GLOBAL_Real_AllAddr_gfx11<0x017, "av_global_load_b128">;
 defm GLOBAL_STORE_BYTE          : GLOBAL_Real_AllAddr_gfx11<0x018, "global_store_b8">;
 defm GLOBAL_STORE_SHORT         : GLOBAL_Real_AllAddr_gfx11<0x019, "global_store_b16">;
 defm GLOBAL_STORE_DWORD         : GLOBAL_Real_AllAddr_gfx11<0x01a, "global_store_b32">;
 defm GLOBAL_STORE_DWORDX2       : GLOBAL_Real_AllAddr_gfx11<0x01b, "global_store_b64">;
 defm GLOBAL_STORE_DWORDX3       : GLOBAL_Real_AllAddr_gfx11<0x01c, "global_store_b96">;
-defm GLOBAL_STORE_DWORDX4       : GLOBAL_Real_AllAddr_gfx11<0x01d, "global_store_b128">;
+defm GLOBAL_STORE_DWORDX4       : GLOBAL_Real_AllAddr_gfx11<0x01d, "av_global_store_b128">;
 defm GLOBAL_LOAD_UBYTE_D16      : GLOBAL_Real_AllAddr_gfx11<0x01e, "global_load_d16_u8">;
 defm GLOBAL_LOAD_SBYTE_D16      : GLOBAL_Real_AllAddr_gfx11<0x01f, "global_load_d16_i8">;
 defm GLOBAL_LOAD_SHORT_D16      : GLOBAL_Real_AllAddr_gfx11<0x020, "global_load_d16_b16">;
@@ -3535,13 +3548,13 @@ defm GLOBAL_LOAD_SSHORT            : VFLAT_Real_AllAddr_gfx12<0x013, "global_loa
 defm GLOBAL_LOAD_DWORD             : VFLAT_Real_AllAddr_gfx12<0x014, "global_load_b32">;
 defm GLOBAL_LOAD_DWORDX2           : VFLAT_Real_AllAddr_gfx12<0x015, "global_load_b64">;
 defm GLOBAL_LOAD_DWORDX3           : VFLAT_Real_AllAddr_gfx12<0x016, "global_load_b96">;
-defm GLOBAL_LOAD_DWORDX4           : VFLAT_Real_AllAddr_gfx12<0x017, "global_load_b128">;
+defm GLOBAL_LOAD_DWORDX4           : VFLAT_Real_AllAddr_gfx12<0x017, "av_global_load_b128">;
 defm GLOBAL_STORE_BYTE             : VFLAT_Real_AllAddr_gfx12<0x018, "global_store_b8">;
 defm GLOBAL_STORE_SHORT            : VFLAT_Real_AllAddr_gfx12<0x019, "global_store_b16">;
 defm GLOBAL_STORE_DWORD            : VFLAT_Real_AllAddr_gfx12<0x01a, "global_store_b32">;
 defm GLOBAL_STORE_DWORDX2          : VFLAT_Real_AllAddr_gfx12<0x01b, "global_store_b64">;
 defm GLOBAL_STORE_DWORDX3          : VFLAT_Real_AllAddr_gfx12<0x01c, "global_store_b96">;
-defm GLOBAL_STORE_DWORDX4          : VFLAT_Real_AllAddr_gfx12<0x01d, "global_store_b128">;
+defm GLOBAL_STORE_DWORDX4          : VFLAT_Real_AllAddr_gfx12<0x01d, "av_global_store_b128">;
 defm GLOBAL_LOAD_UBYTE_D16         : VFLAT_Real_AllAddr_gfx12<0x01e, "global_load_d16_u8">;
 defm GLOBAL_LOAD_SBYTE_D16         : VFLAT_Real_AllAddr_gfx12<0x01f, "global_load_d16_i8">;
 defm GLOBAL_LOAD_SHORT_D16         : VFLAT_Real_AllAddr_gfx12<0x020, "global_load_d16_b16">;
@@ -3709,7 +3722,7 @@ defm GLOBAL_LOAD_TR4_B64              : VFLAT_Real_AllAddr_gfx1250<0x073>;
 defm GLOBAL_LOAD_TR6_B96              : VFLAT_Real_AllAddr_gfx1250<0x074>;
 
 // Additional aliases for global load transpose instructions.
-def : MnemonicAlias<"global_load_b128_tr_b16", "global_load_tr16_b128">, Requires<[isGFX125xOnly]>;
+def : MnemonicAlias<"av_global_load_b128_tr_b16", "global_load_tr16_b128">, Requires<[isGFX125xOnly]>;
 def : MnemonicAlias<"global_load_b64_tr_b8", "global_load_tr8_b64">, Requires<[isGFX125xOnly]>;
 def : MnemonicAlias<"global_load_b64_tr_b4", "global_load_tr4_b64">, Requires<[isGFX125xOnly]>;
 def : MnemonicAlias<"global_load_b96_tr_b6", "global_load_tr6_b96">, Requires<[isGFX125xOnly]>;
@@ -3852,7 +3865,7 @@ defm GLOBAL_LOAD_USHORT               : VFLAT_Real_AllAddr_gfx13<0x0a, "global_l
 defm GLOBAL_LOAD_SSHORT               : VFLAT_Real_AllAddr_gfx13<0x0b, "global_load_i16">;
 defm GLOBAL_LOAD_DWORD                : VFLAT_Real_AllAddr_gfx13<0x0c, "global_load_b32">;
 defm GLOBAL_LOAD_DWORDX2              : VFLAT_Real_AllAddr_gfx13<0x0d, "global_load_b64">;
-defm GLOBAL_LOAD_DWORDX4              : VFLAT_Real_AllAddr_gfx13<0x0e, "global_load_b128">;
+defm GLOBAL_LOAD_DWORDX4              : VFLAT_Real_AllAddr_gfx13<0x0e, "av_global_load_b128">;
 defm GLOBAL_LOAD_DWORDX3              : VFLAT_Real_AllAddr_gfx13<0x0f, "global_load_b96">;
 defm GLOBAL_LOAD_DWORD_ADDTID         : VFLAT_Real_AllAddr_gfx13<0x16, "global_load_addtid_b32">;
 defm GLOBAL_STORE_DWORD_ADDTID        : VFLAT_Real_AllAddr_gfx13<0x17, "global_store_addtid_b32">;
@@ -3862,7 +3875,7 @@ defm GLOBAL_STORE_SHORT               : VFLAT_Real_AllAddr_gfx13<0x1a, "global_s
 defm GLOBAL_STORE_SHORT_D16_HI        : VFLAT_Real_AllAddr_gfx13<0x1b, "global_store_d16_hi_b16">;
 defm GLOBAL_STORE_DWORD               : VFLAT_Real_AllAddr_gfx13<0x1c, "global_store_b32">;
 defm GLOBAL_STORE_DWORDX2             : VFLAT_Real_AllAddr_gfx13<0x1d, "global_store_b64">;
-defm GLOBAL_STORE_DWORDX4             : VFLAT_Real_AllAddr_gfx13<0x1e, "global_store_b128">;
+defm GLOBAL_STORE_DWORDX4             : VFLAT_Real_AllAddr_gfx13<0x1e, "av_global_store_b128">;
 defm GLOBAL_STORE_DWORDX3             : VFLAT_Real_AllAddr_gfx13<0x1f, "global_store_b96">;
 defm GLOBAL_LOAD_UBYTE_D16            : VFLAT_Real_AllAddr_gfx13<0x20, "global_load_d16_u8">;
 defm GLOBAL_LOAD_UBYTE_D16_HI         : VFLAT_Real_AllAddr_gfx13<0x21, "global_load_d16_hi_u8">;
diff --git a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
index 6750dfcbaac62..2393f2aa47244 100644
--- a/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
+++ b/llvm/lib/Target/AMDGPU/SIISelLowering.cpp
@@ -1718,6 +1718,27 @@ void SITargetLowering::getTgtMemIntrinsic(SmallVectorImpl<IntrinsicInfo> &Infos,
     Infos.push_back(Info);
     return;
   }
+  case Intrinsic::amdgcn_av_global_load_b128:
+  case Intrinsic::amdgcn_av_global_store_b128: {
+    bool IsStore = IntrID == Intrinsic::amdgcn_av_global_store_b128;
+    Info.opc = IsStore ? ISD::INTRINSIC_VOID : ISD::INTRINSIC_W_CHAIN;
+    Info.memVT = MVT::v4i32;
+    Info.ptrVal = CI.getArgOperand(0);
+    Info.flags |=
+        IsStore ? MachineMemOperand::MOStore : MachineMemOperand::MOLoad;
+    // Pretend to be atomic so that SIMemoryLegalizer::expandStore sets cache
+    // flags appropriately.
+    Info.order = AtomicOrdering::Monotonic;
+
+    LLVMContext &Ctx = CI.getContext();
+    unsigned ScopeIdx = CI.arg_size() - 1;
+    MDNode *ScopeMD = cast<MDNode>(
+        cast<MetadataAsValue>(CI.getArgOperand(ScopeIdx))->getMetadata());
+    StringRef Scope = cast<MDStri...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/191390