[Mlir-commits] [mlir] [mlir][AMDGPU] Implement gpu.subgroup_reduce with DPP intrinsics on AMD GPUs (PR #133204)

Wed Apr 16 13:53:12 PDT 2025

================
@@ -362,6 +366,164 @@ struct VectorSubgroupReduceToShuffles final
   unsigned shuffleBitwidth = 0;
   bool matchClustered = false;
 };
+
+std::optional<Value> createSubgroupDPPReduction(OpBuilder &b, Location loc,
+                                                Value input,
+                                                gpu::AllReduceOperation mode,
+                                                const ClusterInfo &ci,
+                                                amdgpu::Chipset chipset) {
+  Value result = input;
+  constexpr int allRows = 0xf;
+  constexpr int allBanks = 0xf;
+  const bool boundCtrl = true;
+  Value lane0 =
+      b.create<arith::ConstantOp>(loc, b.getI32Type(), b.getI32IntegerAttr(0));
+  Value lane32 =
+      b.create<arith::ConstantOp>(loc, b.getI32Type(), b.getI32IntegerAttr(32));
+
+  auto dppReduceAcrossLanes = [&](int numLanes,
+                                  Value res) -> std::optional<Value> {
+    Value dppResult, laneVal;
+
+    switch (numLanes) {
+    case 2:
----------------
krzysz00 wrote:

I think the if >= 2, if >= 4, ... scheme we had before makes the fallthrough more obvious

https://github.com/llvm/llvm-project/pull/133204