[clang] [RFC][Docs][Clang][AMDGPU] Add AMDGPU builtins documentation (PR #181193)

Thu Feb 12 09:43:18 PST 2026

llvmbot wrote:




@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

<details>
<summary>Changes</summary>

Add comprehensive documentation for AMDGPU target-specific builtins
(`AMDGPUBuiltins.rst`) covering argument semantics, restrictions, and
lowering notes for all builtin families.

This documentation was generated by AI (Claude) by cross-referencing:
- `clang/include/clang/Basic/BuiltinsAMDGPU.td` (builtin definitions)
- `llvm/include/llvm/IR/IntrinsicsAMDGPU.td` (intrinsic definitions)
- `clang/lib/Sema/SemaAMDGPU.cpp` (argument validation/constraints)
- `clang/lib/CodeGen/TargetBuiltins/AMDGPU.cpp` (lowering logic)

I did my best to proofread the parts I'm familiar with, but it would be greatly
appreciated if more people could help review it as well.

---

Patch is 69.78 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/181193.diff


2 Files Affected:

- (added) clang/docs/AMDGPUBuiltins.rst (+1807) 
- (modified) clang/docs/index.rst (+1) 


``````````diff

diff --git a/clang/docs/AMDGPUBuiltins.rst b/clang/docs/AMDGPUBuiltins.rst
new file mode 100644
index 0000000000000..9ca9a2bf3bd32
--- /dev/null
+++ b/clang/docs/AMDGPUBuiltins.rst
@@ -0,0 +1,1807 @@
+===============
+AMDGPU Builtins
+===============
+
+.. contents::
+   :local:
+   :depth: 2
+
+This document describes the AMDGPU target-specific builtins available in Clang.
+Most of these builtins provide direct access to AMDGPU hardware instructions
+and intrinsics. They are defined in ``clang/include/clang/Basic/BuiltinsAMDGPU.td``
+and typically lower to LLVM intrinsics defined in
+``llvm/include/llvm/IR/IntrinsicsAMDGPU.td``.
+
+All AMDGPU builtins use the ``__builtin_amdgcn_`` prefix (or ``__builtin_r600_``
+for R600 targets). Arguments marked ``_Constant`` must be compile-time
+constant expressions.
+
+ABI / Special Register Builtins
+===============================
+
+These builtins provide access to kernel dispatch metadata, work-item and
+workgroup identification, and other ABI-level information. They are available
+on all SI+ targets.
+
+Pointer Builtins
+----------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``void __constant * __builtin_amdgcn_dispatch_ptr()``
+     - Returns a pointer (in constant address space 4) to the dispatch packet
+       (``hsa_kernel_dispatch_packet_t``). Used internally to derive workgroup
+       size, grid size, and other dispatch parameters.
+   * - ``void __constant * __builtin_amdgcn_kernarg_segment_ptr()``
+     - Returns a pointer to the beginning of the kernel argument segment.
+   * - ``void __constant * __builtin_amdgcn_implicitarg_ptr()``
+     - Returns a pointer to the implicit arguments appended after explicit
+       kernel arguments. Layout depends on the code object version.
+   * - ``void __constant * __builtin_amdgcn_queue_ptr()``
+     - Returns a pointer to the ``hsa_queue_t`` object for the queue executing
+       the current kernel.
+
+Work-Item and Workgroup Identification
+--------------------------------------
+
+All of these are ``Const`` (pure) builtins that take no arguments and return
+``unsigned int`` (or ``unsigned short`` for workgroup size).
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 30 30
+
+   * - Builtin
+     - Return Type
+     - Description
+   * - ``__builtin_amdgcn_workgroup_id_{x,y,z}()``
+     - ``unsigned int``
+     - Workgroup ID in the specified dimension.
+   * - ``__builtin_amdgcn_workitem_id_{x,y,z}()``
+     - ``unsigned int``
+     - Work-item (thread) ID within the workgroup.
+   * - ``__builtin_amdgcn_workgroup_size_{x,y,z}()``
+     - ``unsigned short``
+     - Workgroup size in the specified dimension. Lowered via a load from the
+       dispatch or implicit argument pointer, not a dedicated instruction.
+   * - ``__builtin_amdgcn_grid_size_{x,y,z}()``
+     - ``unsigned int``
+     - Total grid size in the specified dimension. Lowered via a load from the
+       dispatch pointer.
+
+**GFX1250+ Cluster Identification** (requires ``gfx1250-insts``):
+
+.. list-table::
+   :header-rows: 1
+   :widths: 50 50
+
+   * - Builtin
+     - Description
+   * - ``__builtin_amdgcn_cluster_id_{x,y,z}()``
+     - Cluster ID in the specified dimension.
+   * - ``__builtin_amdgcn_cluster_workgroup_id_{x,y,z}()``
+     - Workgroup ID within the cluster.
+   * - ``__builtin_amdgcn_cluster_workgroup_flat_id()``
+     - Flat (linearized) workgroup ID within the cluster.
+   * - ``__builtin_amdgcn_cluster_workgroup_max_id_{x,y,z}()``
+     - Maximum workgroup ID within the cluster.
+   * - ``__builtin_amdgcn_cluster_workgroup_max_flat_id()``
+     - Maximum flat workgroup ID within the cluster.
+
+Other ABI Builtins
+------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``unsigned int __builtin_amdgcn_mbcnt_lo(unsigned int mask, unsigned int val)``
+     - Counts the number of set bits in ``mask`` for lanes lower than the
+       current lane within the lower 32 bits of the exec mask, adds ``val``.
+   * - ``unsigned int __builtin_amdgcn_mbcnt_hi(unsigned int mask, unsigned int val)``
+     - Same as ``mbcnt_lo`` but for the upper 32 bits of the exec mask.
+   * - ``uint64_t __builtin_amdgcn_s_memtime()``
+     - Returns a 64-bit timestamp counter. Requires ``s-memtime-inst``.
+
+Instruction Builtins
+====================
+
+Scalar Instruction Builtins
+---------------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``unsigned int __builtin_amdgcn_s_getreg(_Constant int hwreg)``
+     - Reads a hardware register. ``hwreg`` is an encoded register specifier
+       (register ID, offset, and width packed into 16 bits).
+   * - ``void __builtin_amdgcn_s_setreg(_Constant int hwreg, unsigned int val)``
+     - Writes ``val`` to a hardware register. ``hwreg`` must be in
+       range [0, 65535].
+   * - ``uint64_t __builtin_amdgcn_s_getpc()``
+     - Returns the current program counter.
+   * - ``void __builtin_amdgcn_s_waitcnt(_Constant int cnt)``
+     - Inserts an ``s_waitcnt`` instruction with the encoded wait count.
+   * - ``void __builtin_amdgcn_s_sendmsg(_Constant int msg, unsigned int gsdata)``
+     - Sends message ``msg`` with GS data in ``gsdata``.
+   * - ``void __builtin_amdgcn_s_sendmsghalt(_Constant int msg, unsigned int gsdata)``
+     - Same as ``s_sendmsg`` but also halts the wavefront.
+   * - ``void __builtin_amdgcn_s_barrier()``
+     - Inserts a workgroup barrier.
+   * - ``void __builtin_amdgcn_s_ttracedata(int data)``
+     - Writes ``data`` to the thread trace buffer.
+   * - ``void __builtin_amdgcn_s_sleep(_Constant int duration)``
+     - Sleeps for approximately ``duration`` cycles.
+   * - ``void __builtin_amdgcn_s_incperflevel(_Constant int level)``
+     - Increments the performance counter level.
+   * - ``void __builtin_amdgcn_s_decperflevel(_Constant int level)``
+     - Decrements the performance counter level.
+   * - ``void __builtin_amdgcn_s_setprio(_Constant short prio)``
+     - Sets the wavefront priority.
+   * - ``void __builtin_amdgcn_s_dcache_inv()``
+     - Invalidates the scalar data cache.
+   * - ``void __builtin_amdgcn_buffer_wbinvl1()``
+     - Write-back and invalidate L1 buffer cache.
+   * - ``unsigned int __builtin_amdgcn_groupstaticsize()``
+     - Returns the size of static LDS allocation in the current workgroup.
+   * - ``unsigned int __builtin_amdgcn_wavefrontsize()``
+     - Returns the wavefront size (32 or 64).
+   * - ``void __builtin_amdgcn_wave_barrier()``
+     - Inserts a wave-level barrier hint.
+
+Division and Math Builtins
+--------------------------
+
+Division Support
+^^^^^^^^^^^^^^^^
+
+These builtins implement steps of the iterative double-precision division
+algorithm.
+
+``__builtin_amdgcn_div_scale`` / ``__builtin_amdgcn_div_scalef``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+   double __builtin_amdgcn_div_scale(double numer, double denom, bool select_quotient, bool *flag_out);
+   float  __builtin_amdgcn_div_scalef(float numer, float denom, bool select_quotient, bool *flag_out);
+
+Scales the numerator or denominator for a subsequent iterative division.
+
+- ``numer``: The numerator.
+- ``denom``: The denominator.
+- ``select_quotient``: If ``true``, selects the numerator for scaling; if
+  ``false``, selects the denominator.
+- ``flag_out``: Pointer to a ``bool`` where the overflow/underflow flag is
+  written.
+
+**Lowering note**: The underlying intrinsic returns ``{result, flag}`` as a
+struct. The builtin unpacks this, returning the result and storing the flag
+through the pointer.
+
+``__builtin_amdgcn_div_fmas`` / ``__builtin_amdgcn_div_fmasf``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+   double __builtin_amdgcn_div_fmas(double a, double b, double c, bool vcc);
+   float  __builtin_amdgcn_div_fmasf(float a, float b, float c, bool vcc);
+
+Fused multiply-add for division, with VCC flag input.
+
+- ``a``, ``b``, ``c``: FMA operands (computes ``a * b + c``).
+- ``vcc``: The flag from ``div_scale``.
+
+**Lowering note**: The integer ``vcc`` argument is converted to ``i1`` via
+``IsNotNull`` before passing to the intrinsic.
+
+``__builtin_amdgcn_div_fixup`` / ``__builtin_amdgcn_div_fixupf`` / ``__builtin_amdgcn_div_fixuph``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+   double __builtin_amdgcn_div_fixup(double a, double b, double c);
+   float  __builtin_amdgcn_div_fixupf(float a, float b, float c);
+   __fp16 __builtin_amdgcn_div_fixuph(__fp16 a, __fp16 b, __fp16 c);  // requires 16-bit-insts
+
+Applies post-division fixup for special values (NaN, Inf, zero).
+
+Trigonometric Pre-operation
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code-block:: c
+
+   double __builtin_amdgcn_trig_preop(double src, int segment);
+   float  __builtin_amdgcn_trig_preopf(float src, int segment);
+
+Looks up ``2.0 / pi`` with segment selector ``segment[4:0]`` for range
+reduction before trigonometric operations.
+
+Single-Argument Math Builtins
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These builtins compute hardware-precision math operations. The ``f32`` versions
+(e.g., ``sinf``, ``logf``) may not handle denormals correctly. The ``h``-suffixed
+variants require ``16-bit-insts``.
+
+.. list-table::
+   :header-rows: 1
+   :widths: 25 25 25 25
+
+   * - Operation
+     - f64
+     - f32
+     - f16
+   * - Reciprocal
+     - ``__builtin_amdgcn_rcp``
+     - ``__builtin_amdgcn_rcpf``
+     - ``__builtin_amdgcn_rcph``
+   * - Square root
+     - ``__builtin_amdgcn_sqrt``
+     - ``__builtin_amdgcn_sqrtf``
+     - ``__builtin_amdgcn_sqrth``
+   * - Reciprocal sqrt
+     - ``__builtin_amdgcn_rsq``
+     - ``__builtin_amdgcn_rsqf``
+     - ``__builtin_amdgcn_rsqh``
+   * - Reciprocal sqrt clamp
+     - ``__builtin_amdgcn_rsq_clamp``
+     - ``__builtin_amdgcn_rsq_clampf``
+     -
+   * - Sine (input: turns)
+     -
+     - ``__builtin_amdgcn_sinf``
+     - ``__builtin_amdgcn_sinh``
+   * - Cosine (input: turns)
+     -
+     - ``__builtin_amdgcn_cosf``
+     - ``__builtin_amdgcn_cosh``
+   * - Log2
+     -
+     - ``__builtin_amdgcn_logf``
+     -
+   * - Log clamp
+     -
+     - ``__builtin_amdgcn_log_clampf``
+     -
+   * - Exp2
+     -
+     - ``__builtin_amdgcn_exp2f``
+     -
+   * - Fraction
+     - ``__builtin_amdgcn_fract``
+     - ``__builtin_amdgcn_fractf``
+     - ``__builtin_amdgcn_fracth``
+   * - Mantissa
+     - ``__builtin_amdgcn_frexp_mant``
+     - ``__builtin_amdgcn_frexp_mantf``
+     - ``__builtin_amdgcn_frexp_manth``
+   * - Exponent
+     - ``__builtin_amdgcn_frexp_exp``
+     - ``__builtin_amdgcn_frexp_expf``
+     - ``__builtin_amdgcn_frexp_exph``
+
+Note: ``sinf``/``cosf`` take input in **turns** (1.0 = full circle), not
+radians. ``logf`` performs ``log2``. ``exp2f`` performs ``2^x``. The ``frexp_exp``
+variants return ``int`` (or ``short`` for f16).
+
+Ldexp
+^^^^^
+
+.. code-block:: c
+
+   double __builtin_amdgcn_ldexp(double x, int exp);
+   float  __builtin_amdgcn_ldexpf(float x, int exp);
+   __fp16 __builtin_amdgcn_ldexph(__fp16 x, int exp);  // requires 16-bit-insts
+
+Computes ``x * 2^exp``. Lowered to the standard ``llvm.ldexp`` intrinsic.
+For the ``h`` variant, the exponent is truncated to ``i16``.
+
+FP Classify
+^^^^^^^^^^^
+
+.. code-block:: c
+
+   bool __builtin_amdgcn_class(double x, int mask);
+   bool __builtin_amdgcn_classf(float x, int mask);
+   bool __builtin_amdgcn_classh(__fp16 x, int mask);  // requires 16-bit-insts
+
+Tests ``x`` against a bitmask of FP classes. Returns ``true`` if ``x`` matches
+any of the selected classes. The ``mask`` bits are:
+
+- Bit 0: Signaling NaN
+- Bit 1: Quiet NaN
+- Bit 2: Negative infinity
+- Bit 3: Negative normal
+- Bit 4: Negative denormal
+- Bit 5: Negative zero
+- Bit 6: Positive zero
+- Bit 7: Positive denormal
+- Bit 8: Positive normal
+- Bit 9: Positive infinity
+
+Median
+^^^^^^
+
+.. code-block:: c
+
+   float  __builtin_amdgcn_fmed3f(float a, float b, float c);
+   __fp16 __builtin_amdgcn_fmed3h(__fp16 a, __fp16 b, __fp16 c);  // requires gfx9-insts
+
+Returns the median (middle value) of three floating-point numbers.
+
+Cube Map Builtins
+^^^^^^^^^^^^^^^^^
+
+Require ``cube-insts``. All take three floats (x, y, z direction vector
+components) and return a float.
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``__builtin_amdgcn_cubeid(x, y, z)``
+     - Returns the face ID (0-5) of the cube map.
+   * - ``__builtin_amdgcn_cubesc(x, y, z)``
+     - Returns the S coordinate for the cube face.
+   * - ``__builtin_amdgcn_cubetc(x, y, z)``
+     - Returns the T coordinate for the cube face.
+   * - ``__builtin_amdgcn_cubema(x, y, z)``
+     - Returns the major axis value.
+
+Data Sharing Builtins
+---------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``int __builtin_amdgcn_ds_swizzle(int data, _Constant int pattern)``
+     - Performs a data-parallel swizzle within the wavefront according to the
+       encoded ``pattern``.
+   * - ``int __builtin_amdgcn_ds_permute(int addr, int data)``
+     - Forward cross-lane permutation. Lane ``i`` gets the ``data`` value from
+       the lane specified by ``addr / 4``.
+   * - ``int __builtin_amdgcn_ds_bpermute(int addr, int data)``
+     - Backward cross-lane permutation. Lane ``i`` reads from the lane
+       specified by ``addr / 4``.
+   * - ``int __builtin_amdgcn_ds_append(int __local *ptr)``
+     - Atomically increments the value at ``ptr`` and returns the old value.
+       The pointer must be in LDS (address space 3).
+   * - ``int __builtin_amdgcn_ds_consume(int __local *ptr)``
+     - Atomically decrements the value at ``ptr`` and returns the new value.
+
+DS Float Atomics
+^^^^^^^^^^^^^^^^
+
+.. code-block:: c
+
+   float __builtin_amdgcn_ds_faddf(float __local *ptr, float val, _Constant int ordering, _Constant int scope, _Constant bool isVolatile);
+   float __builtin_amdgcn_ds_fminf(float __local *ptr, float val, _Constant int ordering, _Constant int scope, _Constant bool isVolatile);
+   float __builtin_amdgcn_ds_fmaxf(float __local *ptr, float val, _Constant int ordering, _Constant int scope, _Constant bool isVolatile);
+
+Perform atomic float add/min/max on LDS memory. The ``ordering`` and ``scope``
+arguments are passed through but the operations are lowered to ``AtomicRMW``
+instructions.
+
+Lane Builtins
+-------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``int __builtin_amdgcn_readfirstlane(int val)``
+     - Returns the value of ``val`` from the first active lane.
+   * - ``int __builtin_amdgcn_readlane(int val, int lane)``
+     - Returns the value of ``val`` from the specified ``lane``.
+
+Bit Manipulation
+----------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``unsigned int __builtin_amdgcn_alignbit(unsigned int hi, unsigned int lo, unsigned int shift)``
+     - Concatenates ``hi:lo`` as a 64-bit value and extracts 32 bits starting
+       at bit ``shift``. Lowered to ``llvm.fshr``.
+   * - ``unsigned int __builtin_amdgcn_alignbyte(unsigned int hi, unsigned int lo, unsigned int shift)``
+     - Same as ``alignbit`` but ``shift`` is in bytes.
+   * - ``unsigned int __builtin_amdgcn_ubfe(unsigned int base, unsigned int offset, unsigned int width)``
+     - Unsigned bitfield extract from ``base`` starting at ``offset`` for
+       ``width`` bits.
+   * - ``unsigned int __builtin_amdgcn_sbfe(unsigned int base, unsigned int offset, unsigned int width)``
+     - Signed bitfield extract.
+   * - ``unsigned int __builtin_amdgcn_lerp(unsigned int a, unsigned int b, unsigned int c)``
+     - Per-byte unsigned linear interpolation. Requires ``lerp-inst``.
+   * - ``unsigned int __builtin_amdgcn_perm(unsigned int a, unsigned int b, unsigned int sel)``
+     - Byte permutation. ``sel`` encodes which byte of the ``a:b`` pair to
+       select for each byte of the result. Requires ``gfx8-insts``.
+
+Conversion Builtins
+-------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``half2 __builtin_amdgcn_cvt_pkrtz(float a, float b)``
+     - Converts two f32 values to packed f16 with round-to-zero.
+   * - ``short2 __builtin_amdgcn_cvt_pknorm_i16(float a, float b)``
+     - Converts two f32 values to packed normalized i16. Requires
+       ``cvt-pknorm-vop2-insts``.
+   * - ``ushort2 __builtin_amdgcn_cvt_pknorm_u16(float a, float b)``
+     - Converts two f32 values to packed normalized u16.
+   * - ``short2 __builtin_amdgcn_cvt_pk_i16(int a, int b)``
+     - Packs two i32 values into i16x2.
+   * - ``ushort2 __builtin_amdgcn_cvt_pk_u16(unsigned int a, unsigned int b)``
+     - Packs two u32 values into u16x2.
+   * - ``unsigned int __builtin_amdgcn_cvt_pk_u8_f32(float val, unsigned int bytesel, unsigned int old)``
+     - Converts ``val`` to u8 and inserts at byte ``bytesel`` in ``old``.
+   * - ``float __builtin_amdgcn_cvt_off_f32_i4(int val)``
+     - Converts a 4-bit integer offset to f32.
+
+SAD (Sum of Absolute Differences)
+---------------------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Builtin
+     - Description
+   * - ``unsigned int __builtin_amdgcn_msad_u8(unsigned int a, unsigned int b, unsigned int c)``
+     - Masked sum of absolute differences of unsigned 8-bit values.
+   * - ``unsigned int __builtin_amdgcn_sad_u8(unsigned int a, unsigned int b, unsigned int c)``
+     - Sum of absolute differences of unsigned 8-bit values. Requires
+       ``sad-insts``.
+   * - ``unsigned int __builtin_amdgcn_sad_hi_u8(unsigned int a, unsigned int b, unsigned int c)``
+     - SAD with result in high 16 bits. Requires ``sad-insts``.
+   * - ``unsigned int __builtin_amdgcn_sad_u16(unsigned int a, unsigned int b, unsigned int c)``
+     - SAD of unsigned 16-bit values. Requires ``sad-insts``.
+   * - ``uint64_t __builtin_amdgcn_qsad_pk_u16_u8(uint64_t a, unsigned int b, uint64_t c)``
+     - Quad SAD packed. Requires ``qsad-insts``.
+   * - ``uint64_t __builtin_amdgcn_mqsad_pk_u16_u8(uint64_t a, unsigned int b, uint64_t c)``
+     - Masked quad SAD packed.
+   * - ``uint4 __builtin_amdgcn_mqsad_u32_u8(uint64_t a, unsigned int b, uint4 c)``
+     - Masked quad SAD returning 4x u32.
+
+Buffer Resource and Load/Store
+==============================
+
+make_buffer_rsrc
+----------------
+
+.. code-block:: c
+
+   __amdgpu_buffer_rsrc_t __builtin_amdgcn_make_buffer_rsrc(void *base, short stride, int64_t num_records, int flags);
+
+Constructs a buffer resource descriptor from the given fields:
+
+- ``base``: Base pointer.
+- ``stride``: Stride of structured buffer (0 for raw).
+- ``num_records``: Number of records (bytes for raw buffers).
+- ``flags``: SRD flags (DST_SEL, NUM_FORMAT, DATA_FORMAT, etc.).
+
+Raw Buffer Load/Store
+---------------------
+
+These builtins load/store data through a buffer resource descriptor.
+
+.. code-block:: c
+
+   // Stores
+   void __builtin_amdgcn_raw_buffer_store_b{8,16,32,64,96,128}(data, __amdgpu_buffer_rsrc_t rsrc, int offset, int soffset, _Constant int cachepolicy);
+   // Loads
+   T __builtin_amdgcn_raw_buffer_load_b{8,16,32,64,96,128}(__amdgpu_buffer_rsrc_t rsrc, int offset, int soffset, _Constant int cachepolicy);
+
+Arguments:
+
+- ``rsrc``: Buffer resource descriptor (128-bit SRD, typically SGPR).
+- ``offset``: Byte offset (VGPR or immediate). Included in bounds checking and
+  swizzling.
+- ``soffset``: Scalar byte offset (SGPR or immediate). Excluded from bounds
+  checking and swizzling.
+- ``cachepolicy``: Immediate bitfield controlling cache behavior:
+
+  - Pre-GFX12: bit 0 = GLC, bit 1 = SLC, bit 2 = DLC (gfx10/gfx11),
+    bit 3 = SWZ, bit 4 = SCC (gfx90a).
+  - GFX942: bit 0 = SC0, bit 1 = NT, bit 3 = SWZ, bit 4 = SC1.
+  - GFX12+: bits [0:2] = TH, bits [3:4] = scope, bit 6 = SWZ.
+  - All: bit 31 = volatile (stripped at lowering).
+
+The data types for each width are: ``b8`` = ``unsigned char``,
+``b16`` = ``unsigned short``, ``b32`` = ``unsigned int``,
+``b64`` = ``uint2``, ``b96`` = ``uint3``, ``b128`` = ``uint4``.
+
+Raw Ptr Buffer Atomics
+---...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/181193