[llvm-branch-commits] [llvm] [AMDGPU] Update documentation for wave reduction intrinsics (PR #175132)

Wed Jan 21 01:54:47 PST 2026

https://github.com/easyonaadit updated https://github.com/llvm/llvm-project/pull/175132

>From de56a2be7ba1aa266c8c57f61c915f6e877dc8ab Mon Sep 17 00:00:00 2001
From: Aaditya <Aaditya.AlokDeshpande at amd.com>
Date: Fri, 9 Jan 2026 12:05:04 +0530
Subject: [PATCH] [AMDGPU] Update documentation for wave reduction intrinsics

---
 llvm/docs/AMDGPUUsage.rst | 74 ++++++++++++++++++++++++++++++++++++---
 1 file changed, 70 insertions(+), 4 deletions(-)

diff --git a/llvm/docs/AMDGPUUsage.rst b/llvm/docs/AMDGPUUsage.rst
index 994affe0cfa06..d4ae6987633d2 100644
--- a/llvm/docs/AMDGPUUsage.rst
+++ b/llvm/docs/AMDGPUUsage.rst
@@ -1378,9 +1378,19 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
                                                    0: Target default preference,
                                                    1: `Iterative strategy`, and
                                                    2: `DPP`.
-                                                   If target does not support the DPP operations (e.g. gfx6/7),
+                                                   If the target does not support the DPP operations (e.g. gfx6/7),
                                                    reduction will be performed using default iterative strategy.
-                                                   Intrinsic is currently only implemented for i32.
+                                                   Intrinsic is implemented for i32 and i64 types.
+
+  llvm.amdgcn.wave.reduce.min                      Similar to `llvm.amdgcn.wave.reduce.umin`, but performs a signed min
+                                                   reduction on signed integers.
+                                                   Intrinsic is implemented for i32 and i64 types.
+
+  llvm.amdgcn.wave.reduce.fmin                     Similar to `llvm.amdgcn.wave.reduce.umin`, but performs a floating point min
+                                                   reduction on floating point values.
+                                                   Intrinsic is implemented for float and double types.
+                                                   NAN values are not canonnicalized.
+                                                   The ordering behaviour of SNANs is non-deterministic.
 
   llvm.amdgcn.wave.reduce.umax                     Performs an arithmetic unsigned max reduction on the unsigned values
                                                    provided by each lane in the wavefront.
@@ -1388,9 +1398,65 @@ The AMDGPU backend implements the following LLVM IR intrinsics.
                                                    0: Target default preference,
                                                    1: `Iterative strategy`, and
                                                    2: `DPP`.
-                                                   If target does not support the DPP operations (e.g. gfx6/7),
+                                                   If the target does not support the DPP operations (e.g. gfx6/7),
                                                    reduction will be performed using default iterative strategy.
-                                                   Intrinsic is currently only implemented for i32.
+                                                   Intrinsic is implemented for i32 and i64 types.
+
+  llvm.amdgcn.wave.reduce.max                      Similar to `llvm.amdgcn.wave.reduce.umax`, but performs a signed max
+                                                   reduction on signed integers.
+                                                   Intrinsic is implemented for i32 and i64 types.
+
+  llvm.amdgcn.wave.reduce.fmax                     Similar to `llvm.amdgcn.wave.reduce.umax`, but performs a floating point max
+                                                   reduction on floating point values.
+                                                   Intrinsic is implemented for float and double types.
+                                                   NAN values are not canonnicalized.
+                                                   The ordering behaviour of SNANs is non-deterministic.
+
+  llvm.amdgcn.wave.reduce.add                      Performs an arithmetic add reduction on the signed/unsigned values
+                                                   provided by each lane in the wavefront.
+                                                   Intrinsic takes a hint for reduction strategy using second operand
+                                                   0: Target default preference,
+                                                   1: `Iterative strategy`, and
+                                                   2: `DPP`.
+                                                   If the target does not support the DPP operations (e.g. gfx6/7),
+                                                   reduction will be performed using default iterative strategy.
+                                                   Intrinsic is implemented for signed/unsigned i32 and i64 types.
+
+  llvm.amdgcn.wave.reduce.fadd                     Similar to `llvm.amdgcn.wave.reduce.add`, but performs a floating point add
+                                                   reduction on floating point values.
+                                                   Intrinsic is implemented for float and double types.
+
+  llvm.amdgcn.wave.reduce.sub                      Performs an arithmetic sub reduction on the signed/unsigned values
+                                                   provided by each lane in the wavefront.
+                                                   Intrinsic takes a hint for reduction strategy using second operand
+                                                   0: Target default preference,
+                                                   1: `Iterative strategy`, and
+                                                   2: `DPP`.
+                                                   If the target does not support the DPP operations (e.g. gfx6/7),
+                                                   reduction will be performed using default iterative strategy.
+                                                   Intrinsic is implemented for signed/unsigned i32 and i64 types.
+
+  llvm.amdgcn.wave.reduce.fsub                     Similar to `llvm.amdgcn.wave.reduce.sub`, but performs a floating point sub
+                                                   reduction on floating point values.
+                                                   Intrinsic is implemented for float and double types.
+
+  llvm.amdgcn.wave.reduce.and                      Performs a bitwise-and reduction on the values
+                                                   provided by each lane in the wavefront.
+                                                   Intrinsic takes a hint for reduction strategy using second operand
+                                                   0: Target default preference,
+                                                   1: `Iterative strategy`, and
+                                                   2: `DPP`.
+                                                   If the target does not support the DPP operations (e.g. gfx6/7),
+                                                   reduction will be performed using default iterative strategy.
+                                                   Intrinsic is implemented for i32 and i64 types.
+
+  llvm.amdgcn.wave.reduce.or                       Similar to `llvm.amdgcn.wave.reduce.and`, but performs a bitwise-or
+                                                   reduction on the values provided by each wavefront.
+                                                   Intrinsic is implemented for i32 and i64 types.
+
+  llvm.amdgcn.wave.reduce.xor                      Similar to `llvm.amdgcn.wave.reduce.and`, but performs a bitwise-xor
+                                                   reduction on the values provided by each wavefront.
+                                                   Intrinsic is implemented for i32 and i64 types.
 
   llvm.amdgcn.permlane16                           Provides direct access to v_permlane16_b32. Performs arbitrary gather-style
                                                    operation within a row (16 contiguous lanes) of the second input operand.