[clang] [RFC][Docs][Clang][AMDGPU] Add AMDGPU builtins documentation (PR #181193)
via cfe-commits
cfe-commits at lists.llvm.org
Thu Feb 12 09:43:18 PST 2026
llvmbot wrote:
<!--LLVM PR SUMMARY COMMENT-->
@llvm/pr-subscribers-backend-amdgpu
Author: Shilei Tian (shiltian)
<details>
<summary>Changes</summary>
Add comprehensive documentation for AMDGPU target-specific builtins
(`AMDGPUBuiltins.rst`) covering argument semantics, restrictions, and
lowering notes for all builtin families.
This documentation was generated by AI (Claude) by cross-referencing:
- `clang/include/clang/Basic/BuiltinsAMDGPU.td` (builtin definitions)
- `llvm/include/llvm/IR/IntrinsicsAMDGPU.td` (intrinsic definitions)
- `clang/lib/Sema/SemaAMDGPU.cpp` (argument validation/constraints)
- `clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp` (lowering logic)
I did my best to proofread the parts I'm familiar with, but it would be greatly
appreciated if more people could help review it as well.
---
Patch is 69.78 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/181193.diff
2 Files Affected:
- (added) clang/docs/AMDGPUBuiltins.rst (+1807)
- (modified) clang/docs/index.rst (+1)
``````````diff
diff --git a/clang/docs/AMDGPUBuiltins.rst b/clang/docs/AMDGPUBuiltins.rst
new file mode 100644
index 0000000000000..9ca9a2bf3bd32
--- /dev/null
+++ b/clang/docs/AMDGPUBuiltins.rst
@@ -0,0 +1,1807 @@
+===============
+AMDGPU Builtins
+===============
+
+.. contents::
+ :local:
+ :depth: 2
+
+This document describes the AMDGPU target-specific builtins available in Clang.
+Most of these builtins provide direct access to AMDGPU hardware instructions
+and intrinsics. They are defined in ``clang/include/clang/Basic/BuiltinsAMDGPU.td``
+and typically lower to LLVM intrinsics defined in
+``llvm/include/llvm/IR/IntrinsicsAMDGPU.td``.
+
+All AMDGPU builtins use the ``__builtin_amdgcn_`` prefix (or ``__builtin_r600_``
+for R600 targets). Arguments marked ``_Constant`` must be compile-time
+constant expressions.
+
+ABI / Special Register Builtins
+===============================
+
+These builtins provide access to kernel dispatch metadata, work-item and
+workgroup identification, and other ABI-level information. They are available
+on all SI+ targets.
+
+Pointer Builtins
+----------------
+
+.. list-table::
+ :header-rows: 1
+ :widths: 40 60
+
+ * - Builtin
+ - Description
+ * - ``void __constant * __builtin_amdgcn_dispatch_ptr()``
+ - Returns a pointer (in constant address space 4) to the dispatch packet
+ (``hsa_kernel_dispatch_packet_t``). Used internally to derive workgroup
+ size, grid size, and other dispatch parameters.
+ * - ``void __constant * __builtin_amdgcn_kernarg_segment_ptr()``
+ - Returns a pointer to the beginning of the kernel argument segment.
+ * - ``void __constant * __builtin_amdgcn_implicitarg_ptr()``
+ - Returns a pointer to the implicit arguments appended after explicit
+ kernel arguments. Layout depends on the code object version.
+ * - ``void __constant * __builtin_amdgcn_queue_ptr()``
+ - Returns a pointer to the ``hsa_queue_t`` object for the queue executing
+ the current kernel.
+
+Work-Item and Workgroup Identification
+--------------------------------------
+
+All of these are ``Const`` (pure) builtins that take no arguments and return
+``unsigned int`` (or ``unsigned short`` for workgroup size).
+
+.. list-table::
+ :header-rows: 1
+ :widths: 40 30 30
+
+ * - Builtin
+ - Return Type
+ - Description
+ * - ``__builtin_amdgcn_workgroup_id_{x,y,z}()``
+ - ``unsigned int``
+ - Workgroup ID in the specified dimension.
+ * - ``__builtin_amdgcn_workitem_id_{x,y,z}()``
+ - ``unsigned int``
+ - Work-item (thread) ID within the workgroup.
+ * - ``__builtin_amdgcn_workgroup_size_{x,y,z}()``
+ - ``unsigned short``
+ - Workgroup size in the specified dimension. Lowered via a load from the
+ dispatch or implicit argument pointer, not a dedicated instruction.
+ * - ``__builtin_amdgcn_grid_size_{x,y,z}()``
+ - ``unsigned int``
+ - Total grid size in the specified dimension. Lowered via a load from the
+ dispatch pointer.
+
+**GFX1250+ Cluster Identification** (requires ``gfx1250-insts``):
+
+.. list-table::
+ :header-rows: 1
+ :widths: 50 50
+
+ * - Builtin
+ - Description
+ * - ``__builtin_amdgcn_cluster_id_{x,y,z}()``
+ - Cluster ID in the specified dimension.
+ * - ``__builtin_amdgcn_cluster_workgroup_id_{x,y,z}()``
+ - Workgroup ID within the cluster.
+ * - ``__builtin_amdgcn_cluster_workgroup_flat_id()``
+ - Flat (linearized) workgroup ID within the cluster.
+ * - ``__builtin_amdgcn_cluster_workgroup_max_id_{x,y,z}()``
+ - Maximum workgroup ID within the cluster.
+ * - ``__builtin_amdgcn_cluster_workgroup_max_flat_id()``
+ - Maximum flat workgroup ID within the cluster.
+
+Other ABI Builtins
+------------------
+
+.. list-table::
+ :header-rows: 1
+ :widths: 40 60
+
+ * - Builtin
+ - Description
+ * - ``unsigned int __builtin_amdgcn_mbcnt_lo(unsigned int mask, unsigned int val)``
+ - Counts the number of set bits in ``mask`` for lanes lower than the
+ current lane within the lower 32 bits of the exec mask, adds ``val``.
+ * - ``unsigned int __builtin_amdgcn_mbcnt_hi(unsigned int mask, unsigned int val)``
+ - Same as ``mbcnt_lo`` but for the upper 32 bits of the exec mask.
+ * - ``uint64_t __builtin_amdgcn_s_memtime()``
+ - Returns a 64-bit timestamp counter. Requires ``s-memtime-inst``.
+
+Instruction Builtins
+====================
+
+Scalar Instruction Builtins
+---------------------------
+
+.. list-table::
+ :header-rows: 1
+ :widths: 40 60
+
+ * - Builtin
+ - Description
+ * - ``unsigned int __builtin_amdgcn_s_getreg(_Constant int hwreg)``
+ - Reads a hardware register. ``hwreg`` is an encoded register specifier
+ (register ID, offset, and width packed into 16 bits).
+ * - ``void __builtin_amdgcn_s_setreg(_Constant int hwreg, unsigned int val)``
+ - Writes ``val`` to a hardware register. ``hwreg`` must be in
+ range [0, 65535].
+ * - ``uint64_t __builtin_amdgcn_s_getpc()``
+ - Returns the current program counter.
+ * - ``void __builtin_amdgcn_s_waitcnt(_Constant int cnt)``
+ - Inserts an ``s_waitcnt`` instruction with the encoded wait count.
+ * - ``void __builtin_amdgcn_s_sendmsg(_Constant int msg, unsigned int gsdata)``
+ - Sends message ``msg`` with GS data in ``gsdata``.
+ * - ``void __builtin_amdgcn_s_sendmsghalt(_Constant int msg, unsigned int gsdata)``
+ - Same as ``s_sendmsg`` but also halts the wavefront.
+ * - ``void __builtin_amdgcn_s_barrier()``
+ - Inserts a workgroup barrier.
+ * - ``void __builtin_amdgcn_s_ttracedata(int data)``
+ - Writes ``data`` to the thread trace buffer.
+ * - ``void __builtin_amdgcn_s_sleep(_Constant int duration)``
+ - Sleeps for approximately ``duration`` cycles.
+ * - ``void __builtin_amdgcn_s_incperflevel(_Constant int level)``
+ - Increments the performance counter level.
+ * - ``void __builtin_amdgcn_s_decperflevel(_Constant int level)``
+ - Decrements the performance counter level.
+ * - ``void __builtin_amdgcn_s_setprio(_Constant short prio)``
+ - Sets the wavefront priority.
+ * - ``void __builtin_amdgcn_s_dcache_inv()``
+ - Invalidates the scalar data cache.
+ * - ``void __builtin_amdgcn_buffer_wbinvl1()``
+ - Write-back and invalidate L1 buffer cache.
+ * - ``unsigned int __builtin_amdgcn_groupstaticsize()``
+ - Returns the size of static LDS allocation in the current workgroup.
+ * - ``unsigned int __builtin_amdgcn_wavefrontsize()``
+ - Returns the wavefront size (32 or 64).
+ * - ``void __builtin_amdgcn_wave_barrier()``
+ - Inserts a wave-level barrier hint.
+
+Division and Math Builtins
+--------------------------
+
+Division Support
+^^^^^^^^^^^^^^^^
+
+These builtins implement steps of the iterative double-precision division
+algorithm.
+
+``__builtin_amdgcn_div_scale`` / ``__builtin_amdgcn_div_scalef``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ double __builtin_amdgcn_div_scale(double numer, double denom, bool select_quotient, bool *flag_out);
+ float __builtin_amdgcn_div_scalef(float numer, float denom, bool select_quotient, bool *flag_out);
+
+Scales the numerator or denominator for a subsequent iterative division.
+
+- ``numer``: The numerator.
+- ``denom``: The denominator.
+- ``select_quotient``: If ``true``, selects the numerator for scaling; if
+ ``false``, selects the denominator.
+- ``flag_out``: Pointer to a ``bool`` where the overflow/underflow flag is
+ written.
+
+**Lowering note**: The underlying intrinsic returns ``{result, flag}`` as a
+struct. The builtin unpacks this, returning the result and storing the flag
+through the pointer.
+
+``__builtin_amdgcn_div_fmas`` / ``__builtin_amdgcn_div_fmasf``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ double __builtin_amdgcn_div_fmas(double a, double b, double c, bool vcc);
+ float __builtin_amdgcn_div_fmasf(float a, float b, float c, bool vcc);
+
+Fused multiply-add for division, with VCC flag input.
+
+- ``a``, ``b``, ``c``: FMA operands (computes ``a * b + c``).
+- ``vcc``: The flag from ``div_scale``.
+
+**Lowering note**: The integer ``vcc`` argument is converted to ``i1`` via
+``IsNotNull`` before passing to the intrinsic.
+
+``__builtin_amdgcn_div_fixup`` / ``__builtin_amdgcn_div_fixupf`` / ``__builtin_amdgcn_div_fixuph``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ double __builtin_amdgcn_div_fixup(double a, double b, double c);
+ float __builtin_amdgcn_div_fixupf(float a, float b, float c);
+ __fp16 __builtin_amdgcn_div_fixuph(__fp16 a, __fp16 b, __fp16 c); // requires 16-bit-insts
+
+Applies post-division fixup for special values (NaN, Inf, zero).
+
+Trigonometric Pre-operation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: c
+
+ double __builtin_amdgcn_trig_preop(double src, int segment);
+ float __builtin_amdgcn_trig_preopf(float src, int segment);
+
+Looks up ``2.0 / pi`` with segment selector ``segment[4:0]`` for range
+reduction before trigonometric operations.
+
+Single-Argument Math Builtins
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These builtins compute hardware-precision math operations. The ``f32`` versions
+(e.g., ``sinf``, ``logf``) may not handle denormals correctly. The ``h``-suffixed
+variants require ``16-bit-insts``.
+
+.. list-table::
+ :header-rows: 1
+ :widths: 25 25 25 25
+
+ * - Operation
+ - f64
+ - f32
+ - f16
+ * - Reciprocal
+ - ``__builtin_amdgcn_rcp``
+ - ``__builtin_amdgcn_rcpf``
+ - ``__builtin_amdgcn_rcph``
+ * - Square root
+ - ``__builtin_amdgcn_sqrt``
+ - ``__builtin_amdgcn_sqrtf``
+ - ``__builtin_amdgcn_sqrth``
+ * - Reciprocal sqrt
+ - ``__builtin_amdgcn_rsq``
+ - ``__builtin_amdgcn_rsqf``
+ - ``__builtin_amdgcn_rsqh``
+ * - Reciprocal sqrt clamp
+ - ``__builtin_amdgcn_rsq_clamp``
+ - ``__builtin_amdgcn_rsq_clampf``
+ -
+ * - Sine (input: turns)
+ -
+ - ``__builtin_amdgcn_sinf``
+ - ``__builtin_amdgcn_sinh``
+ * - Cosine (input: turns)
+ -
+ - ``__builtin_amdgcn_cosf``
+ - ``__builtin_amdgcn_cosh``
+ * - Log2
+ -
+ - ``__builtin_amdgcn_logf``
+ -
+ * - Log clamp
+ -
+ - ``__builtin_amdgcn_log_clampf``
+ -
+ * - Exp2
+ -
+ - ``__builtin_amdgcn_exp2f``
+ -
+ * - Fraction
+ - ``__builtin_amdgcn_fract``
+ - ``__builtin_amdgcn_fractf``
+ - ``__builtin_amdgcn_fracth``
+ * - Mantissa
+ - ``__builtin_amdgcn_frexp_mant``
+ - ``__builtin_amdgcn_frexp_mantf``
+ - ``__builtin_amdgcn_frexp_manth``
+ * - Exponent
+ - ``__builtin_amdgcn_frexp_exp``
+ - ``__builtin_amdgcn_frexp_expf``
+ - ``__builtin_amdgcn_frexp_exph``
+
+Note: ``sinf``/``cosf`` take input in **turns** (1.0 = full circle), not
+radians. ``logf`` performs ``log2``. ``exp2f`` performs ``2^x``. The ``frexp_exp``
+variants return ``int`` (or ``short`` for f16).
+
+Ldexp
+^^^^^
+
+.. code-block:: c
+
+ double __builtin_amdgcn_ldexp(double x, int exp);
+ float __builtin_amdgcn_ldexpf(float x, int exp);
+ __fp16 __builtin_amdgcn_ldexph(__fp16 x, int exp); // requires 16-bit-insts
+
+Computes ``x * 2^exp``. Lowered to the standard ``llvm.ldexp`` intrinsic.
+For the ``h`` variant, the exponent is truncated to ``i16``.
+
+FP Classify
+^^^^^^^^^^^
+
+.. code-block:: c
+
+ bool __builtin_amdgcn_class(double x, int mask);
+ bool __builtin_amdgcn_classf(float x, int mask);
+ bool __builtin_amdgcn_classh(__fp16 x, int mask); // requires 16-bit-insts
+
+Tests ``x`` against a bitmask of FP classes. Returns ``true`` if ``x`` matches
+any of the selected classes. The ``mask`` bits are:
+
+- Bit 0: Signaling NaN
+- Bit 1: Quiet NaN
+- Bit 2: Negative infinity
+- Bit 3: Negative normal
+- Bit 4: Negative denormal
+- Bit 5: Negative zero
+- Bit 6: Positive zero
+- Bit 7: Positive denormal
+- Bit 8: Positive normal
+- Bit 9: Positive infinity
+
+Median
+^^^^^^
+
+.. code-block:: c
+
+ float __builtin_amdgcn_fmed3f(float a, float b, float c);
+ __fp16 __builtin_amdgcn_fmed3h(__fp16 a, __fp16 b, __fp16 c); // requires gfx9-insts
+
+Returns the median (middle value) of three floating-point numbers.
+
+Cube Map Builtins
+^^^^^^^^^^^^^^^^^
+
+Require ``cube-insts``. All take three floats (x, y, z direction vector
+components) and return a float.
+
+.. list-table::
+ :header-rows: 1
+ :widths: 40 60
+
+ * - Builtin
+ - Description
+ * - ``__builtin_amdgcn_cubeid(x, y, z)``
+ - Returns the face ID (0-5) of the cube map.
+ * - ``__builtin_amdgcn_cubesc(x, y, z)``
+ - Returns the S coordinate for the cube face.
+ * - ``__builtin_amdgcn_cubetc(x, y, z)``
+ - Returns the T coordinate for the cube face.
+ * - ``__builtin_amdgcn_cubema(x, y, z)``
+ - Returns the major axis value.
+
+Data Sharing Builtins
+---------------------
+
+.. list-table::
+ :header-rows: 1
+ :widths: 40 60
+
+ * - Builtin
+ - Description
+ * - ``int __builtin_amdgcn_ds_swizzle(int data, _Constant int pattern)``
+ - Performs a data-parallel swizzle within the wavefront according to the
+ encoded ``pattern``.
+ * - ``int __builtin_amdgcn_ds_permute(int addr, int data)``
+ - Forward cross-lane permutation. Lane ``i`` gets the ``data`` value from
+ the lane specified by ``addr / 4``.
+ * - ``int __builtin_amdgcn_ds_bpermute(int addr, int data)``
+ - Backward cross-lane permutation. Lane ``i`` reads from the lane
+ specified by ``addr / 4``.
+ * - ``int __builtin_amdgcn_ds_append(int __local *ptr)``
+ - Atomically increments the value at ``ptr`` and returns the old value.
+ The pointer must be in LDS (address space 3).
+ * - ``int __builtin_amdgcn_ds_consume(int __local *ptr)``
+ - Atomically decrements the value at ``ptr`` and returns the new value.
+
+DS Float Atomics
+^^^^^^^^^^^^^^^^
+
+.. code-block:: c
+
+ float __builtin_amdgcn_ds_faddf(float __local *ptr, float val, _Constant int ordering, _Constant int scope, _Constant bool isVolatile);
+ float __builtin_amdgcn_ds_fminf(float __local *ptr, float val, _Constant int ordering, _Constant int scope, _Constant bool isVolatile);
+ float __builtin_amdgcn_ds_fmaxf(float __local *ptr, float val, _Constant int ordering, _Constant int scope, _Constant bool isVolatile);
+
+Perform atomic float add/min/max on LDS memory. The ``ordering`` and ``scope``
+arguments are passed through but the operations are lowered to ``AtomicRMW``
+instructions.
+
+Lane Builtins
+-------------
+
+.. list-table::
+ :header-rows: 1
+ :widths: 40 60
+
+ * - Builtin
+ - Description
+ * - ``int __builtin_amdgcn_readfirstlane(int val)``
+ - Returns the value of ``val`` from the first active lane.
+ * - ``int __builtin_amdgcn_readlane(int val, int lane)``
+ - Returns the value of ``val`` from the specified ``lane``.
+
+Bit Manipulation
+----------------
+
+.. list-table::
+ :header-rows: 1
+ :widths: 40 60
+
+ * - Builtin
+ - Description
+ * - ``unsigned int __builtin_amdgcn_alignbit(unsigned int hi, unsigned int lo, unsigned int shift)``
+ - Concatenates ``hi:lo`` as a 64-bit value and extracts 32 bits starting
+ at bit ``shift``. Lowered to ``llvm.fshr``.
+ * - ``unsigned int __builtin_amdgcn_alignbyte(unsigned int hi, unsigned int lo, unsigned int shift)``
+ - Same as ``alignbit`` but ``shift`` is in bytes.
+ * - ``unsigned int __builtin_amdgcn_ubfe(unsigned int base, unsigned int offset, unsigned int width)``
+ - Unsigned bitfield extract from ``base`` starting at ``offset`` for
+ ``width`` bits.
+ * - ``unsigned int __builtin_amdgcn_sbfe(unsigned int base, unsigned int offset, unsigned int width)``
+ - Signed bitfield extract.
+ * - ``unsigned int __builtin_amdgcn_lerp(unsigned int a, unsigned int b, unsigned int c)``
+ - Per-byte unsigned linear interpolation. Requires ``lerp-inst``.
+ * - ``unsigned int __builtin_amdgcn_perm(unsigned int a, unsigned int b, unsigned int sel)``
+ - Byte permutation. ``sel`` encodes which byte of the ``a:b`` pair to
+ select for each byte of the result. Requires ``gfx8-insts``.
+
+Conversion Builtins
+-------------------
+
+.. list-table::
+ :header-rows: 1
+ :widths: 40 60
+
+ * - Builtin
+ - Description
+ * - ``half2 __builtin_amdgcn_cvt_pkrtz(float a, float b)``
+ - Converts two f32 values to packed f16 with round-to-zero.
+ * - ``short2 __builtin_amdgcn_cvt_pknorm_i16(float a, float b)``
+ - Converts two f32 values to packed normalized i16. Requires
+ ``cvt-pknorm-vop2-insts``.
+ * - ``ushort2 __builtin_amdgcn_cvt_pknorm_u16(float a, float b)``
+ - Converts two f32 values to packed normalized u16.
+ * - ``short2 __builtin_amdgcn_cvt_pk_i16(int a, int b)``
+ - Packs two i32 values into i16x2.
+ * - ``ushort2 __builtin_amdgcn_cvt_pk_u16(unsigned int a, unsigned int b)``
+ - Packs two u32 values into u16x2.
+ * - ``unsigned int __builtin_amdgcn_cvt_pk_u8_f32(float val, unsigned int bytesel, unsigned int old)``
+ - Converts ``val`` to u8 and inserts at byte ``bytesel`` in ``old``.
+ * - ``float __builtin_amdgcn_cvt_off_f32_i4(int val)``
+ - Converts a 4-bit integer offset to f32.
+
+SAD (Sum of Absolute Differences)
+---------------------------------
+
+.. list-table::
+ :header-rows: 1
+ :widths: 40 60
+
+ * - Builtin
+ - Description
+ * - ``unsigned int __builtin_amdgcn_msad_u8(unsigned int a, unsigned int b, unsigned int c)``
+ - Masked sum of absolute differences of unsigned 8-bit values.
+ * - ``unsigned int __builtin_amdgcn_sad_u8(unsigned int a, unsigned int b, unsigned int c)``
+ - Sum of absolute differences of unsigned 8-bit values. Requires
+ ``sad-insts``.
+ * - ``unsigned int __builtin_amdgcn_sad_hi_u8(unsigned int a, unsigned int b, unsigned int c)``
+ - SAD with result in high 16 bits. Requires ``sad-insts``.
+ * - ``unsigned int __builtin_amdgcn_sad_u16(unsigned int a, unsigned int b, unsigned int c)``
+ - SAD of unsigned 16-bit values. Requires ``sad-insts``.
+ * - ``uint64_t __builtin_amdgcn_qsad_pk_u16_u8(uint64_t a, unsigned int b, uint64_t c)``
+ - Quad SAD packed. Requires ``qsad-insts``.
+ * - ``uint64_t __builtin_amdgcn_mqsad_pk_u16_u8(uint64_t a, unsigned int b, uint64_t c)``
+ - Masked quad SAD packed.
+ * - ``uint4 __builtin_amdgcn_mqsad_u32_u8(uint64_t a, unsigned int b, uint4 c)``
+ - Masked quad SAD returning 4x u32.
+
+Buffer Resource and Load/Store
+==============================
+
+make_buffer_rsrc
+----------------
+
+.. code-block:: c
+
+ __amdgpu_buffer_rsrc_t __builtin_amdgcn_make_buffer_rsrc(void *base, short stride, int64_t num_records, int flags);
+
+Constructs a buffer resource descriptor from the given fields:
+
+- ``base``: Base pointer.
+- ``stride``: Stride of structured buffer (0 for raw).
+- ``num_records``: Number of records (bytes for raw buffers).
+- ``flags``: SRD flags (DST_SEL, NUM_FORMAT, DATA_FORMAT, etc.).
+
+Raw Buffer Load/Store
+---------------------
+
+These builtins load/store data through a buffer resource descriptor.
+
+.. code-block:: c
+
+ // Stores
+ void __builtin_amdgcn_raw_buffer_store_b{8,16,32,64,96,128}(data, __amdgpu_buffer_rsrc_t rsrc, int offset, int soffset, _Constant int cachepolicy);
+ // Loads
+ T __builtin_amdgcn_raw_buffer_load_b{8,16,32,64,96,128}(__amdgpu_buffer_rsrc_t rsrc, int offset, int soffset, _Constant int cachepolicy);
+
+Arguments:
+
+- ``rsrc``: Buffer resource descriptor (128-bit SRD, typically SGPR).
+- ``offset``: Byte offset (VGPR or immediate). Included in bounds checking and
+ swizzling.
+- ``soffset``: Scalar byte offset (SGPR or immediate). Excluded from bounds
+ checking and swizzling.
+- ``cachepolicy``: Immediate bitfield controlling cache behavior:
+
+ - Pre-GFX12: bit 0 = GLC, bit 1 = SLC, bit 2 = DLC (gfx10/gfx11),
+ bit 3 = SWZ, bit 4 = SCC (gfx90a).
+ - GFX942: bit 0 = SC0, bit 1 = NT, bit 3 = SWZ, bit 4 = SC1.
+ - GFX12+: bits [0:2] = TH, bits [3:4] = scope, bit 6 = SWZ.
+ - All: bit 31 = volatile (stripped at lowering).
+
+The data types for each width are: ``b8`` = ``unsigned char``,
+``b16`` = ``unsigned short``, ``b32`` = ``unsigned int``,
+``b64`` = ``uint2``, ``b96`` = ``uint3``, ``b128`` = ``uint4``.
+
+Raw Ptr Buffer Atomics
+---...
[truncated]
``````````
</details>
https://github.com/llvm/llvm-project/pull/181193
More information about the cfe-commits
mailing list