[llvm] Handle VECREDUCE intrinsics in NVPTX backend (PR #136253)

Mon Apr 21 20:20:18 PDT 2025

================
@@ -2128,6 +2152,194 @@ NVPTXTargetLowering::LowerCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) const {
   return DAG.getBuildVector(Node->getValueType(0), dl, Ops);
 }
 
+/// A generic routine for constructing a tree reduction on a vector operand.
+/// This method differs from iterative splitting in DAGTypeLegalizer by
+/// progressively grouping elements bottom-up.
----------------
Prince781 wrote:

> When vectors are loaded in chunks, it may be beneficial to reduce the first loaded chunk, while the subsequent ones may still be in flight. Attempting to use elements from different fragments of the vector will stall on data dependency until all of the arrive.

In this respect, iterative splitting comes out even worse. Since DAGTypeLegalizer splits vector operands into two, then performs an element-wise partial reduction on them, this results in maximum striding.

Example:
```
res: f32 = vecreduce_fadd reassoc <4 x f32> <f32 a, f32 b, f32 c, f32 d>
```

1. `<4 x f32>` is illegal for `vecreduce_fadd`, so split it:
```
split1: <2 x f32> = fadd <2 x f32> <f32 a, f32 b>, <2 x f32> <f32 c, f32 d>
res: f32 = vecreduce_fadd reassoc <2 x f32> split1
```

2. `<2 x f32>` is illegal for `vecreduce_fadd`, so split it:
```
split1: <2 x f32> = fadd <2 x f32> <f32 a, f32 b>, <2 x f32> <f32 c, f32 d>
split2: f32 = fadd f32 split1:0, f32 split1:1
res: f32 = vecreduce_fadd reassoc f32 split2
```

3. `<2 x f32>` is illegal for `fadd`, so split it:
```
split1_ac: f32 = fadd f32 a, f32 c
split1_bd: f32 = fadd f32 b, f32 d
split2: f32 = fadd f32 split1_ac, f32 split1_bc
res: f32 = vecreduce_fadd reassoc f32 split2
```

4. Can fold `vecreduce_fadd`:
```
split1_ac: f32 = fadd f32 a, f32 c
split1_bd: f32 = fadd f32 b, f32 d
res: f32 = fadd f32 split1_ac, f32 split1_bc
```

We see this pattern in the test cases:

https://github.com/llvm/llvm-project/blob/b144258b0c0cc63dffba00a911d6539f00ed07bb/llvm/test/CodeGen/NVPTX/reduction-intrinsics.ll#L806-L814

https://github.com/llvm/llvm-project/pull/136253