[clang] [llvm] AMDGPU: Add builtin/intrinsic global_(load|store)_b128 (PR #172090)
via llvm-commits
llvm-commits at lists.llvm.org
Fri Dec 12 13:59:17 PST 2025
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-llvm-ir
Author: None (macurtis-amd)
<details>
<summary>Changes</summary>
Add clang builtins and associated llvm intrinsics for scoped load/store of 128bits
New builtins:
1. `__builtin_amdgcn_global_load_b128` ([documentation](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/clang/docs/LanguageExtensions.rst#__builtin_amdgcn_global_load_b128-and-__builtin_amdgcn_global_store_b128), [test/examples](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl))
2. `__builtin_amdgcn_global_store_b128` ([documentation](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/clang/docs/LanguageExtensions.rst#__builtin_amdgcn_global_load_b128-and-__builtin_amdgcn_global_store_b128), [test/examples](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl))
And corresponding intrinsics:
1. `llvm.amdgcn.global.load.b128` ([documentation](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/llvm/docs/AMDGPUUsage.rst) - search for intrinsic name, [test/examples](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.load.b128.ll) )
2. `llvm.amdgcn.global.store.b128` ([documentation](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/llvm/docs/AMDGPUUsage.rst) - search for intrinsic name, [test/examples](https://github.com/macurtis-amd/llvm-project/blob/global-load-store-b128/llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.load.b128.ll) )
These will initially be used by [RCCL](https://github.com/ROCm/rccl) to address some low-level performance issues.
---
Patch is 1.74 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/172090.diff
23 Files Affected:
- (modified) clang/docs/LanguageExtensions.rst (+37)
- (modified) clang/include/clang/Basic/BuiltinsAMDGPU.def (+3)
- (modified) clang/include/clang/Sema/SemaAMDGPU.h (+2)
- (modified) clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp (+20)
- (modified) clang/lib/Sema/SemaAMDGPU.cpp (+16)
- (added) clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl (+113)
- (added) clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-error.cl (+22)
- (added) clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-target-error.cl (+26)
- (modified) llvm/docs/AMDGPUUsage.rst (+106)
- (modified) llvm/include/llvm/CodeGen/GlobalISel/GIMatchTableExecutor.h (+6)
- (modified) llvm/include/llvm/CodeGen/GlobalISel/GIMatchTableExecutorImpl.h (+9)
- (modified) llvm/include/llvm/IR/IntrinsicsAMDGPU.td (+25)
- (modified) llvm/lib/IR/Verifier.cpp (+30-3)
- (modified) llvm/lib/Target/AMDGPU/AMDGPURegisterBankInfo.cpp (+3)
- (modified) llvm/lib/Target/AMDGPU/FLATInstructions.td (+15)
- (modified) llvm/lib/Target/AMDGPU/SIISelLowering.cpp (+22)
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.load.b128.ll (+30869)
- (added) llvm/test/CodeGen/AMDGPU/llvm.amdgcn.global.store.b128.ll (+3888)
- (added) llvm/test/CodeGen/AMDGPU/unsupported-global-load.ll (+36)
- (added) llvm/test/CodeGen/AMDGPU/unsupported-global-store.ll (+36)
- (added) llvm/test/Verifier/amdgpu-intrinsics.ll (+66)
- (modified) llvm/utils/TableGen/Common/GlobalISel/GlobalISelMatchTable.cpp (+17)
- (modified) llvm/utils/TableGen/Common/GlobalISel/GlobalISelMatchTable.h (+18)
``````````diff
diff --git a/clang/docs/LanguageExtensions.rst b/clang/docs/LanguageExtensions.rst
index c4b86b203d383..4d4d6ca3fe0bd 100644
--- a/clang/docs/LanguageExtensions.rst
+++ b/clang/docs/LanguageExtensions.rst
@@ -5243,6 +5243,43 @@ returns the bit at the position of the current lane. It is almost equivalent to
``(mask & (1 << lane_id)) != 0``, except that its behavior is only defined if
the given mask has the same value for all active lanes of the current wave.
+
+__builtin_amdgcn_global_load_b128 and __builtin_amdgcn_global_store_b128
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Signature:
+
+.. code-block:: c
+
+ typedef __attribute__((__vector_size__(4 * sizeof(unsigned int)))) unsigned int v4u;
+ typedef v4u __attribute__((address_space(1))) *global_ptr_to_v4u;
+
+ v4u __builtin_amdgcn_global_load_b128(
+ v4u __attribute__((address_space(1))) *src,
+ const char *scope);
+
+ void __builtin_amdgcn_global_store_b128(
+ v4u __attribute__((address_space(1))) *dst,
+ v4u data,
+ const char *scope);
+
+Load or store a vector of 4 unsigned integers from or to global memory with
+cache behavior specified by `scope` which must be a string literal.
+
+Valid values for `scope` are:
+
+* ``"wavefront"``
+* ``"workgroup"``
+* ``"agent"``
+* ``""`` (empty string)
+
+These builtins are supported on gfx9, gfx10, gfx11, and gfx12 targets.
+
+They map to the llvm intrinsics ``llvm.amdgcn.global.load.b128`` and
+``llvm.amdgcn.global.store.b128`` documented in `User Guide for AMDGPU Backend
+<https://llvm.org/docs/AMDGPUUsage.html>`_.
+
+
ARM/AArch64 Language Extensions
-------------------------------
diff --git a/clang/include/clang/Basic/BuiltinsAMDGPU.def b/clang/include/clang/Basic/BuiltinsAMDGPU.def
index a867144d83928..4bc5b1c16f2ad 100644
--- a/clang/include/clang/Basic/BuiltinsAMDGPU.def
+++ b/clang/include/clang/Basic/BuiltinsAMDGPU.def
@@ -190,6 +190,9 @@ TARGET_BUILTIN(__builtin_amdgcn_raw_ptr_buffer_atomic_fmax_f64, "ddQbiiIi", "",
TARGET_BUILTIN(__builtin_amdgcn_raw_ptr_buffer_load_lds, "vQbv*3IUiiiIiIi", "", "vmem-to-lds-load-insts")
TARGET_BUILTIN(__builtin_amdgcn_struct_ptr_buffer_load_lds, "vQbv*3IUiiiiIiIi", "", "vmem-to-lds-load-insts")
+TARGET_BUILTIN(__builtin_amdgcn_global_load_b128, "V4UiV4Ui*1cC*", "n", "gfx9-insts")
+TARGET_BUILTIN(__builtin_amdgcn_global_store_b128, "vV4Ui*1V4UicC*", "n", "gfx9-insts")
+
//===----------------------------------------------------------------------===//
// Ballot builtins.
//===----------------------------------------------------------------------===//
diff --git a/clang/include/clang/Sema/SemaAMDGPU.h b/clang/include/clang/Sema/SemaAMDGPU.h
index bac812a9d4fcf..556bfb705de67 100644
--- a/clang/include/clang/Sema/SemaAMDGPU.h
+++ b/clang/include/clang/Sema/SemaAMDGPU.h
@@ -28,6 +28,8 @@ class SemaAMDGPU : public SemaBase {
bool checkCoopAtomicFunctionCall(CallExpr *TheCall, bool IsStore);
+ bool checkScopedMemAccessFunctionCall(CallExpr *TheCall);
+
bool checkMovDPPFunctionCall(CallExpr *TheCall, unsigned NumArgs,
unsigned NumDataArgs);
diff --git a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
index eabdc370da6b4..384f76e092252 100644
--- a/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
+++ b/clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp
@@ -885,6 +885,26 @@ Value *CodeGenFunction::EmitAMDGPUBuiltinExpr(unsigned BuiltinID,
llvm::Function *F = CGM.getIntrinsic(IID, {Args[0]->getType()});
return Builder.CreateCall(F, {Args});
}
+ case AMDGPU::BI__builtin_amdgcn_global_load_b128:
+ case AMDGPU::BI__builtin_amdgcn_global_store_b128: {
+ const bool IsStore =
+ BuiltinID == AMDGPU::BI__builtin_amdgcn_global_store_b128;
+ LLVMContext &Ctx = CGM.getLLVMContext();
+ SmallVector<Value *, 5> Args = {EmitScalarExpr(E->getArg(0))}; // addr
+ if (IsStore)
+ Args.push_back(EmitScalarExpr(E->getArg(1))); // data
+ const unsigned ScopeIdx = E->getNumArgs() - 1;
+ StringRef ScopeLit =
+ cast<StringLiteral>(E->getArg(ScopeIdx)->IgnoreParenCasts())
+ ->getString();
+ llvm::MDNode *MD =
+ llvm::MDNode::get(Ctx, {llvm::MDString::get(Ctx, ScopeLit)});
+ Args.push_back(llvm::MetadataAsValue::get(Ctx, MD)); // scope
+ llvm::Function *F =
+ CGM.getIntrinsic(IsStore ? Intrinsic::amdgcn_global_store_b128
+ : Intrinsic::amdgcn_global_load_b128);
+ return Builder.CreateCall(F, Args);
+ }
case AMDGPU::BI__builtin_amdgcn_get_fpenv: {
Function *F = CGM.getIntrinsic(Intrinsic::get_fpenv,
{llvm::Type::getInt64Ty(getLLVMContext())});
diff --git a/clang/lib/Sema/SemaAMDGPU.cpp b/clang/lib/Sema/SemaAMDGPU.cpp
index cece22092bb14..72c7bf03f93ad 100644
--- a/clang/lib/Sema/SemaAMDGPU.cpp
+++ b/clang/lib/Sema/SemaAMDGPU.cpp
@@ -255,6 +255,9 @@ bool SemaAMDGPU::CheckAMDGCNBuiltinFunctionCall(unsigned BuiltinID,
(SemaRef.BuiltinConstantArg(TheCall, ArgCount, Result)) ||
(SemaRef.BuiltinConstantArg(TheCall, (ArgCount - 1), Result));
}
+ case AMDGPU::BI__builtin_amdgcn_global_load_b128:
+ case AMDGPU::BI__builtin_amdgcn_global_store_b128:
+ return checkScopedMemAccessFunctionCall(TheCall);
default:
return false;
}
@@ -344,6 +347,19 @@ bool SemaAMDGPU::checkCoopAtomicFunctionCall(CallExpr *TheCall, bool IsStore) {
return Fail;
}
+bool SemaAMDGPU::checkScopedMemAccessFunctionCall(CallExpr *TheCall) {
+ bool Fail = false;
+ // Last argument is a string literal
+ Expr *Arg = TheCall->getArg(TheCall->getNumArgs() - 1);
+ auto Scope = dyn_cast<StringLiteral>(Arg->IgnoreParenCasts());
+ if (!Scope) {
+ Fail = true;
+ Diag(TheCall->getBeginLoc(), diag::err_expr_not_string_literal)
+ << Arg->getSourceRange();
+ }
+ return Fail;
+}
+
bool SemaAMDGPU::checkMovDPPFunctionCall(CallExpr *TheCall, unsigned NumArgs,
unsigned NumDataArgs) {
assert(NumDataArgs <= 2);
diff --git a/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl
new file mode 100644
index 0000000000000..7ffceead747e8
--- /dev/null
+++ b/clang/test/CodeGenOpenCL/builtins-amdgcn-global-load-store.cl
@@ -0,0 +1,113 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --check-globals smart
+// REQUIRES: amdgpu-registered-target
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx950 -emit-llvm -o - %s | FileCheck %s -check-prefixes=GFX,GFX950
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx9-4-generic -emit-llvm -o - %s | FileCheck %s -check-prefixes=GFX,GFX9_4_GENERIC
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx1250 -emit-llvm -o - %s | FileCheck %s -check-prefixes=GFX,GFX1250
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx12-generic -emit-llvm -o - %s | FileCheck %s -check-prefixes=GFX,GFX12_GENERIC
+
+typedef __attribute__((__vector_size__(4 * sizeof(unsigned int)))) unsigned int v4u32;
+typedef v4u32 __global *global_ptr_to_v4u32;
+
+//------------------------------------------------------------------------------
+// Store
+//------------------------------------------------------------------------------
+// GFX-LABEL: @test_amdgcn_global_store_b128_00(
+// GFX-NEXT: entry:
+// GFX-NEXT: tail call void @llvm.amdgcn.global.store.b128(ptr addrspace(1) [[PTR:%.*]], <4 x i32> [[DATA:%.*]], metadata [[META4:![0-9]+]])
+// GFX-NEXT: ret void
+//
+void test_amdgcn_global_store_b128_00(global_ptr_to_v4u32 ptr, v4u32 data) {
+ __builtin_amdgcn_global_store_b128(ptr, data, "wavefront");
+}
+
+// GFX-LABEL: @test_amdgcn_global_store_b128_01(
+// GFX-NEXT: entry:
+// GFX-NEXT: tail call void @llvm.amdgcn.global.store.b128(ptr addrspace(1) [[PTR:%.*]], <4 x i32> [[DATA:%.*]], metadata [[META5:![0-9]+]])
+// GFX-NEXT: ret void
+//
+void test_amdgcn_global_store_b128_01(global_ptr_to_v4u32 ptr, v4u32 data) {
+ __builtin_amdgcn_global_store_b128(ptr, data, "workgroup");
+}
+
+// GFX-LABEL: @test_amdgcn_global_store_b128_10(
+// GFX-NEXT: entry:
+// GFX-NEXT: tail call void @llvm.amdgcn.global.store.b128(ptr addrspace(1) [[PTR:%.*]], <4 x i32> [[DATA:%.*]], metadata [[META6:![0-9]+]])
+// GFX-NEXT: ret void
+//
+void test_amdgcn_global_store_b128_10(global_ptr_to_v4u32 ptr, v4u32 data) {
+ __builtin_amdgcn_global_store_b128(ptr, data, "agent");
+}
+
+// GFX-LABEL: @test_amdgcn_global_store_b128_11(
+// GFX-NEXT: entry:
+// GFX-NEXT: tail call void @llvm.amdgcn.global.store.b128(ptr addrspace(1) [[PTR:%.*]], <4 x i32> [[DATA:%.*]], metadata [[META7:![0-9]+]])
+// GFX-NEXT: ret void
+//
+void test_amdgcn_global_store_b128_11(global_ptr_to_v4u32 ptr, v4u32 data) {
+ __builtin_amdgcn_global_store_b128(ptr, data, "");
+}
+
+//------------------------------------------------------------------------------
+// Load
+//------------------------------------------------------------------------------
+// GFX-LABEL: @test_amdgcn_global_load_b128_00(
+// GFX-NEXT: entry:
+// GFX-NEXT: [[TMP0:%.*]] = tail call <4 x i32> @llvm.amdgcn.global.load.b128(ptr addrspace(1) [[PTR:%.*]], metadata [[META4]])
+// GFX-NEXT: ret <4 x i32> [[TMP0]]
+//
+v4u32 test_amdgcn_global_load_b128_00(global_ptr_to_v4u32 ptr) {
+ return __builtin_amdgcn_global_load_b128(ptr, "wavefront");
+}
+
+// GFX-LABEL: @test_amdgcn_global_load_b128_01(
+// GFX-NEXT: entry:
+// GFX-NEXT: [[TMP0:%.*]] = tail call <4 x i32> @llvm.amdgcn.global.load.b128(ptr addrspace(1) [[PTR:%.*]], metadata [[META5]])
+// GFX-NEXT: ret <4 x i32> [[TMP0]]
+//
+v4u32 test_amdgcn_global_load_b128_01(global_ptr_to_v4u32 ptr) {
+ return __builtin_amdgcn_global_load_b128(ptr, "workgroup");
+}
+
+// GFX-LABEL: @test_amdgcn_global_load_b128_10(
+// GFX-NEXT: entry:
+// GFX-NEXT: [[TMP0:%.*]] = tail call <4 x i32> @llvm.amdgcn.global.load.b128(ptr addrspace(1) [[PTR:%.*]], metadata [[META6]])
+// GFX-NEXT: ret <4 x i32> [[TMP0]]
+//
+v4u32 test_amdgcn_global_load_b128_10(global_ptr_to_v4u32 ptr) {
+ return __builtin_amdgcn_global_load_b128(ptr, "agent");
+}
+
+// GFX-LABEL: @test_amdgcn_global_load_b128_11(
+// GFX-NEXT: entry:
+// GFX-NEXT: [[TMP0:%.*]] = tail call <4 x i32> @llvm.amdgcn.global.load.b128(ptr addrspace(1) [[PTR:%.*]], metadata [[META7]])
+// GFX-NEXT: ret <4 x i32> [[TMP0]]
+//
+v4u32 test_amdgcn_global_load_b128_11(global_ptr_to_v4u32 ptr) {
+ return __builtin_amdgcn_global_load_b128(ptr, "");
+}
+//.
+// GFX950: [[META4]] = !{!"wavefront"}
+// GFX950: [[META5]] = !{!"workgroup"}
+// GFX950: [[META6]] = !{!"agent"}
+// GFX950: [[META7]] = !{!""}
+//.
+// GFX9_4_GENERIC: [[META4]] = !{!"wavefront"}
+// GFX9_4_GENERIC: [[META5]] = !{!"workgroup"}
+// GFX9_4_GENERIC: [[META6]] = !{!"agent"}
+// GFX9_4_GENERIC: [[META7]] = !{!""}
+//.
+// GFX1250: [[META4]] = !{!"wavefront"}
+// GFX1250: [[META5]] = !{!"workgroup"}
+// GFX1250: [[META6]] = !{!"agent"}
+// GFX1250: [[META7]] = !{!""}
+//.
+// GFX12_GENERIC: [[META4]] = !{!"wavefront"}
+// GFX12_GENERIC: [[META5]] = !{!"workgroup"}
+// GFX12_GENERIC: [[META6]] = !{!"agent"}
+// GFX12_GENERIC: [[META7]] = !{!""}
+//.
+//// NOTE: These prefixes are unused and the list is autogenerated. Do not add tests below this line:
+// GFX1250: {{.*}}
+// GFX12_GENERIC: {{.*}}
+// GFX950: {{.*}}
+// GFX9_4_GENERIC: {{.*}}
diff --git a/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-error.cl b/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-error.cl
new file mode 100644
index 0000000000000..b21b604baa944
--- /dev/null
+++ b/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-error.cl
@@ -0,0 +1,22 @@
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx950 -S -verify -o - %s
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx9-4-generic -S -verify -o - %s
+// REQUIRES: amdgpu-registered-target
+
+typedef __attribute__((__vector_size__(4 * sizeof(unsigned int)))) unsigned int v4u32;
+typedef v4u32 __global *global_ptr_to_v4u32;
+
+void test_amdgcn_global_store_b128_00(v4u32 *ptr, v4u32 data, const char* scope) {
+ __builtin_amdgcn_global_store_b128(ptr, data, ""); //expected-error{{passing '__private v4u32 *__private' to parameter of type '__attribute__((__vector_size__(4 * sizeof(unsigned int)))) unsigned int __global *' changes address space of pointer}}
+}
+
+void test_amdgcn_global_store_b128_01(global_ptr_to_v4u32 ptr, v4u32 data, const char* scope) {
+ __builtin_amdgcn_global_store_b128(ptr, data, scope); //expected-error{{expression is not a string literal}}
+}
+
+v4u32 test_amdgcn_global_load_b128_00(v4u32 *ptr, const char* scope) {
+ return __builtin_amdgcn_global_load_b128(ptr, ""); //expected-error{{passing '__private v4u32 *__private' to parameter of type '__attribute__((__vector_size__(4 * sizeof(unsigned int)))) unsigned int __global *' changes address space of pointer}}
+}
+
+v4u32 test_amdgcn_global_load_b128_01(global_ptr_to_v4u32 ptr, const char* scope) {
+ return __builtin_amdgcn_global_load_b128(ptr, scope); //expected-error{{expression is not a string literal}}
+}
diff --git a/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-target-error.cl b/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-target-error.cl
new file mode 100644
index 0000000000000..ec357c58ef903
--- /dev/null
+++ b/clang/test/SemaOpenCL/builtins-amdgcn-global-load-store-target-error.cl
@@ -0,0 +1,26 @@
+// We test loads and stores separately because clang only seems to exit after
+// the first 'target feature' error.
+
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx602 -DTEST_LOAD -S -verify -o - %s
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx705 -DTEST_LOAD -S -verify -o - %s
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx810 -DTEST_LOAD -S -verify -o - %s
+
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx602 -DTEST_STORE -S -verify -o - %s
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx705 -DTEST_STORE -S -verify -o - %s
+// RUN: %clang_cc1 -triple amdgcn-unknown-unknown -target-cpu gfx810 -DTEST_STORE -S -verify -o - %s
+// REQUIRES: amdgpu-registered-target
+
+typedef __attribute__((__vector_size__(4 * sizeof(unsigned int)))) unsigned int v4u32;
+typedef v4u32 __global *global_ptr_to_v4u32;
+
+#ifdef TEST_LOAD
+v4u32 test_amdgcn_global_load_b128_01(global_ptr_to_v4u32 ptr, const char* scope) {
+ return __builtin_amdgcn_global_load_b128(ptr, ""); // expected-error{{'__builtin_amdgcn_global_load_b128' needs target feature gfx9-insts}}
+}
+#endif
+
+#ifdef TEST_STORE
+void test_amdgcn_global_store_b128_01(global_ptr_to_v4u32 ptr, v4u32 data, const char* scope) {
+ __builtin_amdgcn_global_store_b128(ptr, data, ""); // expected-error{{'__builtin_amdgcn_global_store_b128' needs target feature gfx9-insts}}
+}
+#endif
diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 7ecf1c1124894..39afd29737156 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -1596,6 +1596,112 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
* 1 - Data cache.
Instruction cache prefetches are unsafe on invalid address.
+
+ llvm.amdgcn.global.load.b128 This intrinsic is supported on gfx9, gfx10, gfx11, and gfx12 targets.
+
+ Signature:
+
+ .. code-block:: llvm
+
+ <4 x i32> @llvm.amdgcn.global.load.b128(
+ ptr addrspace(1), ; source
+ metadata) ; scope - e.g. '!0' where '!0 = !{!"wavegroup"}'
+
+ Reads the value from the source address with cache behavior specified by the scope.
+
+ The following table shows the mapping between valid scope values and target
+ instruction flags or field values.
+
+ ============== ========================== ========================== ========================== ========================== ==========================
+ targets instruction ``"wavefront"`` ``"workgroup"`` ``"agent"`` ``""`` (empty string)
+ ============== ========================== ========================== ========================== ========================== ==========================
+ gfx90* ``global_load_dwordx4`` ``glc`` ``glc``
+
+ gfx942, gfx950 ``global_load_dwordx4`` (wave) ``sc0`` (group) ``sc1`` (device) ``sc0 sc1`` (system)
+
+ gfx10* ``global_load_dwordx4`` ``glc`` ``glc dlc`` ``glc dlc``
+
+ gfx11* ``global_load_dwordx4`` ``glc`` ``glc`` ``glc``
+
+ gfx120* ``global_load_b128`` (CU) ``scope:SCOPE_SE`` (SE) ``scope:SCOPE_DEV`` (DEV) ``scope:SCOPE_SYS`` (SYS)
+
+ gfx125* ``global_load_b128`` (CU) ``scope:SCOPE_DEV`` (DEV) ``scope:SCOPE_SYS`` (SYS)
+ ============== ========================== ========================== ========================== ========================== ==========================
+
+ For gfx90*, see "GLC Bit Explained" in the appropriate instruction set reference
+ (e.g. Chapter 9.1.10 in "AMD Instinct MI100" Instruction Set Architecture Reference
+ Guide).
+
+ For gfx942 and gfx950 targets, see "Memory Scope and Temporal Controls" in the
+ appropriate instruction set reference (e.g. Chapter 9.1.10.2 in the "AMD Instinct
+ MI300" Instruction Set Architecture Reference Guide).
+
+ For gfx10* targets, see "GLC, DLC and SLC Bit Explained" in the appropriate
+ instruction set reference (e.g. Chapter 8.1.10 in "RDNA 2" Instruction Set Architecture
+ Reference Guide)
+
+ For gfx11* targets, see "Cache Controls: SLC, GLC and DLC" in the appropriate
+ instruction set reference (e.g. Chapter 4.1.1 in "RDNA3" Instruction Set Architecture
+ ...
[truncated]
``````````
</details>
https://github.com/llvm/llvm-project/pull/172090
More information about the llvm-commits
mailing list