[Mlir-commits] [mlir] [mlir][amdgpu] Revise AMDGPU dialect DPP documentation (PR #182639)

Fri Feb 20 18:04:34 PST 2026

https://github.com/efric updated https://github.com/llvm/llvm-project/pull/182639

>From 470a4322624f1d7d2671709c34c700d60505bf35 Mon Sep 17 00:00:00 2001
From: Eric Feng <Eric.Feng at amd.com>
Date: Fri, 20 Feb 2026 16:18:42 -0800
Subject: [PATCH 1/7] augment dpp docs

Signed-off-by: Eric Feng <Eric.Feng at amd.com>
---
 .../mlir/Dialect/AMDGPU/IR/AMDGPUOps.td       | 101 +++++++++++++++---
 1 file changed, 87 insertions(+), 14 deletions(-)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
index 589a4a798f3a8..a01a8b8b7cf03 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
@@ -663,20 +663,93 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
                  DefaultValuedAttr<BoolAttr, "false">:$bound_ctrl)> {
   let summary = "AMDGPU DPP operation";
   let description = [{
-    This operation represents DPP functionality in a GPU program.
-     DPP provides the following operations:
-    - Full crossbar in a group of four (`quad_perm`)
-    - Wavefront shift left by one lane (`wave_shl`)
-    - Wavefront shift right by one lane (`wave_shr`)
-    - Wavefront rotate right by one lane (`wave_ror`)
-    - Wavefront rotate left by one lane (`wave_rol`)
-    - Row shift left by 1–15 lanes (`row_shl`)
-    - Row shift right by 1–15 lanes (`row_shr`)
-    - Row rotate right by 1–15 lanes (`row_ror`)
-    - Reverse within a row (`row_mirror`)
-    - Reverse within a half-row (`row_half_mirror`)
-    - Broadcast the 15th lane of each row to the next row (`row_bcast`)
-    - Broadcast lane 31 to rows 2 and 3 (`row_bcast`)
+    The `amdgpu.dpp` op performs a Data Parallel Primitives (DPP) lane
+    permutation on a source value within a wavefront. Each lane reads its
+    source data from another lane according to the permutation mode specified
+    by `kind`. DPP operates at dword (32-bit) granularity: sub-32-bit types
+    (e.g., f16, i16) are packed into an i32 during lowering, permuted, and
+    extracted back.
+
+    A Wave64 wavefront has 64 lanes (0--63) organized hierarchically:
+    - 4 rows of 16 lanes each: row 0 = lanes 0--15, row 1 = lanes 16--31,
+      row 2 = lanes 32--47, row 3 = lanes 48--63.
+    - Each row is divided into 4 banks of 4 consecutive lanes: bank 0 =
+      lanes 0-3, bank 1 = lanes 4-7, bank 2 = lanes 8-11, bank 3 =
+      lanes 12-15 (lane numbers shown for row 0; add 16/32/48 for other rows).
+
+    The `kind` attribute selects the permutation. Some modes require a
+    `permArgument`; others take no argument.
+
+    Quad permutation:
+    - `quad_perm([a, b, c, d])`: Full crossbar within each group of 4
+      consecutive lanes (a quad). Each element is in [0, 3] and selects which
+      lane within the quad to read from. Lane 4k+i reads from lane 4k+perm[i].
+      For example, `quad_perm([1, 0, 3, 2])` swaps adjacent pairs within
+      every quad.
+
+    Row shifts and rotates (operate within each 16-lane row independently):
+    - `row_shl(N)`: Shift left by N (1--15) within the row. Lane n reads from
+      lane (n % 16) + N in the same row. Lanes where the source index exceeds
+      15 are out of bounds (see `bound_ctrl`).
+    - `row_shr(N)`: Shift right by N (1--15) within the row. Lane n reads from
+      lane (n % 16) - N in the same row. Lanes where the source index is
+      negative are out of bounds.
+    - `row_ror(N)`: Rotate right by N (1--15) within the row. Lane n reads from
+      lane ((n % 16) - N) mod 16 in the same row. Always in bounds.
+
+    Wavefront shifts and rotates (operate across all 64 lanes):
+    - `wave_shl`: Shift left by 1. Lane n reads from lane n + 1. Lane 63 is
+      out of bounds.
+    - `wave_shr`: Shift right by 1. Lane n reads from lane n - 1. Lane 0 is
+      out of bounds.
+    - `wave_rol`: Rotate left by 1. Lane n reads from lane (n + 1) mod 64.
+    - `wave_ror`: Rotate right by 1. Lane n reads from lane (n - 1) mod 64.
+
+    Row mirrors:
+    - `row_mirror`: Reverse lanes within each 16-lane row. Lane n reads from
+      lane 15 - (n % 16) within its row.
+    - `row_half_mirror`: Reverse within each 8-lane half-row. Lane n reads
+      from lane 7 - (n % 8) within its half-row.
+
+    Row broadcasts:
+    - `row_bcast_15`: Lane 15 of each row broadcasts to all lanes of the next
+      row. Lanes in row 0 are not affected (retain `old`).
+    - `row_bcast_31`: Lane 31 broadcasts to all lanes in rows 2 and 3.
+      Lanes in rows 0 and 1 are not affected (retain `old`).
+
+    Example:
+    ```mlir
+    // Swap adjacent pairs within each quad (lanes 0<->1, 2<->3, etc.)
+    %0 = amdgpu.dpp %old %src quad_perm([1, 0, 3, 2]) : i32
+
+    // Shift right by 1 lane within each 16-lane row.
+    // bound_ctrl=true -> lanes that would read past the row return 0.
+    // row_mask=0x5 (0b0101) -> only rows 0 and 2 apply the shift;
+    // rows 1 and 3 pass through %old unchanged.
+    %1 = amdgpu.dpp %old %src row_shr(0x1 : i32)
+      { row_mask = 0x5 : i32, bound_ctrl = true } : f32
+
+    // Rotate left across the full wavefront by 1 lane
+    %2 = amdgpu.dpp %old %src wave_rol : i32
+    ```
+
+    Operands:
+    * `$old`: Fallback value. Lanes that are masked off by `row_mask` /
+      `bank_mask` retain `old`. For lanes with an out-of-bounds source, behavior
+      depends on `bound_ctrl`.
+    * `$src`: Source value to be permuted across lanes.
+    * `$kind`: A `#amdgpu.dpp_perm` enum selecting the permutation mode.
+    * `$permArgument`: Mode-specific argument. Required for `quad_perm`
+      (array of 4 integers in [0, 3]) and `row_shl`/`row_shr`/`row_ror`
+      (integer in [1, 15]). Absent for all other modes.
+    * `$row_mask` (default 0xf): 4-bit mask controlling which rows write
+      results. Bit i enables row i (bit 0 = lanes 0-15, bit 1 = lanes
+      16-31, etc.). Disabled lanes retain `old`.
+    * `$bank_mask` (default 0xf): 4-bit mask controlling which banks write
+      results. Bit i enables bank i (bit 0 = lanes 0-3, 16-19, 32-35, 48-51).
+      Disabled lanes retain `old`.
+    * `$bound_ctrl` (default false): When false, out of bounds lanes retain
+      `old`. When true, out-of-bounds lanes receive zero.
   }];
   let results = (outs AnyType:$result);
   let assemblyFormat = [{

>From 296bb8e9961575cf5905cfff02b8e6c62ba8d16e Mon Sep 17 00:00:00 2001
From: Eric Feng <Eric.Feng at amd.com>
Date: Fri, 20 Feb 2026 16:21:46 -0800
Subject: [PATCH 2/7] nits

Signed-off-by: Eric Feng <Eric.Feng at amd.com>
---
 mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
index a01a8b8b7cf03..d324dbd4b80f4 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
@@ -671,8 +671,8 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
     extracted back.
 
     A Wave64 wavefront has 64 lanes (0--63) organized hierarchically:
-    - 4 rows of 16 lanes each: row 0 = lanes 0--15, row 1 = lanes 16--31,
-      row 2 = lanes 32--47, row 3 = lanes 48--63.
+    - 4 rows of 16 lanes each: row 0 = lanes 0-15, row 1 = lanes 16-31,
+      row 2 = lanes 32-47, row 3 = lanes 48-63.
     - Each row is divided into 4 banks of 4 consecutive lanes: bank 0 =
       lanes 0-3, bank 1 = lanes 4-7, bank 2 = lanes 8-11, bank 3 =
       lanes 12-15 (lane numbers shown for row 0; add 16/32/48 for other rows).

>From a4b59cb9f577498b436bd825f00530e625f75cfd Mon Sep 17 00:00:00 2001
From: Eric Feng <Eric.Feng at amd.com>
Date: Fri, 20 Feb 2026 16:23:39 -0800
Subject: [PATCH 3/7] nits

Signed-off-by: Eric Feng <Eric.Feng at amd.com>
---
 mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
index d324dbd4b80f4..156293ff62db1 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
@@ -688,13 +688,13 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
       every quad.
 
     Row shifts and rotates (operate within each 16-lane row independently):
-    - `row_shl(N)`: Shift left by N (1--15) within the row. Lane n reads from
+    - `row_shl(N)`: Shift left by N (1-15) within the row. Lane n reads from
       lane (n % 16) + N in the same row. Lanes where the source index exceeds
       15 are out of bounds (see `bound_ctrl`).
-    - `row_shr(N)`: Shift right by N (1--15) within the row. Lane n reads from
+    - `row_shr(N)`: Shift right by N (1-15) within the row. Lane n reads from
       lane (n % 16) - N in the same row. Lanes where the source index is
       negative are out of bounds.
-    - `row_ror(N)`: Rotate right by N (1--15) within the row. Lane n reads from
+    - `row_ror(N)`: Rotate right by N (1-15) within the row. Lane n reads from
       lane ((n % 16) - N) mod 16 in the same row. Always in bounds.
 
     Wavefront shifts and rotates (operate across all 64 lanes):

>From 2b2140898be3851831827434222cabe24cac7cb6 Mon Sep 17 00:00:00 2001
From: Eric Feng <Eric.Feng at amd.com>
Date: Fri, 20 Feb 2026 16:25:32 -0800
Subject: [PATCH 4/7] nits

Signed-off-by: Eric Feng <Eric.Feng at amd.com>
---
 mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
index 156293ff62db1..4a7b290152a76 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
@@ -670,7 +670,7 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
     (e.g., f16, i16) are packed into an i32 during lowering, permuted, and
     extracted back.
 
-    A Wave64 wavefront has 64 lanes (0--63) organized hierarchically:
+    A Wave64 wavefront has 64 lanes (0-63) organized hierarchically:
     - 4 rows of 16 lanes each: row 0 = lanes 0-15, row 1 = lanes 16-31,
       row 2 = lanes 32-47, row 3 = lanes 48-63.
     - Each row is divided into 4 banks of 4 consecutive lanes: bank 0 =

>From 94786dc525e7a33c1eaea96dfd5c617bf61238c3 Mon Sep 17 00:00:00 2001
From: Eric Feng <Eric.Feng at amd.com>
Date: Fri, 20 Feb 2026 17:07:03 -0800
Subject: [PATCH 5/7] nits

Signed-off-by: Eric Feng <Eric.Feng at amd.com>
---
 mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
index 4a7b290152a76..a5fbd6b583127 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
@@ -720,13 +720,13 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
     Example:
     ```mlir
     // Swap adjacent pairs within each quad (lanes 0<->1, 2<->3, etc.)
-    %0 = amdgpu.dpp %old %src quad_perm([1, 0, 3, 2]) : i32
+    %0 = amdgpu.dpp %old %src quad_perm( [1, 0, 3, 2] ) : i32
 
     // Shift right by 1 lane within each 16-lane row.
     // bound_ctrl=true -> lanes that would read past the row return 0.
     // row_mask=0x5 (0b0101) -> only rows 0 and 2 apply the shift;
     // rows 1 and 3 pass through %old unchanged.
-    %1 = amdgpu.dpp %old %src row_shr(0x1 : i32)
+    %1 = amdgpu.dpp %old %src row_shr( 0x1 : i32 )
       { row_mask = 0x5 : i32, bound_ctrl = true } : f32
 
     // Rotate left across the full wavefront by 1 lane

>From 3bb3a969e2eef9570081ed6edd869891f25f73e3 Mon Sep 17 00:00:00 2001
From: Eric Feng <Eric.Feng at amd.com>
Date: Fri, 20 Feb 2026 17:43:34 -0800
Subject: [PATCH 6/7] nit

Signed-off-by: Eric Feng <Eric.Feng at amd.com>
---
 mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
index a5fbd6b583127..f5658bd859fcf 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
@@ -681,7 +681,7 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
     `permArgument`; others take no argument.
 
     Quad permutation:
-    - `quad_perm([a, b, c, d])`: Full crossbar within each group of 4
+    - `quad_perm([a, b, c, d])`: Full permute within each group of 4
       consecutive lanes (a quad). Each element is in [0, 3] and selects which
       lane within the quad to read from. Lane 4k+i reads from lane 4k+perm[i].
       For example, `quad_perm([1, 0, 3, 2])` swaps adjacent pairs within

>From c21b89f417f7a54312b2cbbe24ec45596bf6e9e9 Mon Sep 17 00:00:00 2001
From: Eric Feng <Eric.Feng at amd.com>
Date: Fri, 20 Feb 2026 18:02:02 -0800
Subject: [PATCH 7/7] address rdna

Signed-off-by: Eric Feng <Eric.Feng at amd.com>
---
 .../mlir/Dialect/AMDGPU/IR/AMDGPUOps.td       | 24 +++++++++++--------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
index f5658bd859fcf..bc88877247546 100644
--- a/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
+++ b/mlir/include/mlir/Dialect/AMDGPU/IR/AMDGPUOps.td
@@ -670,9 +670,11 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
     (e.g., f16, i16) are packed into an i32 during lowering, permuted, and
     extracted back.
 
-    A Wave64 wavefront has 64 lanes (0-63) organized hierarchically:
-    - 4 rows of 16 lanes each: row 0 = lanes 0-15, row 1 = lanes 16-31,
-      row 2 = lanes 32-47, row 3 = lanes 48-63.
+    - Lanes are organized into rows of 16.
+    - A Wave64 wavefront has 4 rows of 16 lanes each: row 0 = lanes 0-15,
+      row 1 = lanes 16-31, row 2 = lanes 32-47, row 3 = lanes 48-63.
+    - Similarly, a Wave32 wavefront has two rows of 16 lanes each, organized
+      in the same fashion.
     - Each row is divided into 4 banks of 4 consecutive lanes: bank 0 =
       lanes 0-3, bank 1 = lanes 4-7, bank 2 = lanes 8-11, bank 3 =
       lanes 12-15 (lane numbers shown for row 0; add 16/32/48 for other rows).
@@ -697,13 +699,15 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
     - `row_ror(N)`: Rotate right by N (1-15) within the row. Lane n reads from
       lane ((n % 16) - N) mod 16 in the same row. Always in bounds.
 
-    Wavefront shifts and rotates (operate across all 64 lanes):
-    - `wave_shl`: Shift left by 1. Lane n reads from lane n + 1. Lane 63 is
-      out of bounds.
+    Wavefront shifts and rotates (not available on RDNA):
+    - `wave_shl`: Shift left by 1. Lane n reads from lane n + 1. The last lane
+      in the wavefront is out of bounds.
     - `wave_shr`: Shift right by 1. Lane n reads from lane n - 1. Lane 0 is
       out of bounds.
-    - `wave_rol`: Rotate left by 1. Lane n reads from lane (n + 1) mod 64.
-    - `wave_ror`: Rotate right by 1. Lane n reads from lane (n - 1) mod 64.
+    - `wave_rol`: Rotate left by 1. Lane n reads from lane (n + 1) mod W, where
+      W is the wavefront size.
+    - `wave_ror`: Rotate right by 1. Lane n reads from lane (n - 1) mod W, where
+      W is the wavefront size.
 
     Row mirrors:
     - `row_mirror`: Reverse lanes within each 16-lane row. Lane n reads from
@@ -711,7 +715,7 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
     - `row_half_mirror`: Reverse within each 8-lane half-row. Lane n reads
       from lane 7 - (n % 8) within its half-row.
 
-    Row broadcasts:
+    Row broadcasts (not available on RDNA):
     - `row_bcast_15`: Lane 15 of each row broadcasts to all lanes of the next
       row. Lanes in row 0 are not affected (retain `old`).
     - `row_bcast_31`: Lane 31 broadcasts to all lanes in rows 2 and 3.
@@ -746,7 +750,7 @@ def AMDGPU_DPPOp : AMDGPU_Op<"dpp",
       results. Bit i enables row i (bit 0 = lanes 0-15, bit 1 = lanes
       16-31, etc.). Disabled lanes retain `old`.
     * `$bank_mask` (default 0xf): 4-bit mask controlling which banks write
-      results. Bit i enables bank i (bit 0 = lanes 0-3, 16-19, 32-35, 48-51).
+      results. Bit i enables bank i (bit 0 = lanes 0-3, 16-19, etc. across all rows).
       Disabled lanes retain `old`.
     * `$bound_ctrl` (default false): When false, out of bounds lanes retain
       `old`. When true, out-of-bounds lanes receive zero.