[llvm] [AArch64][SVE] Improve code quality of vector unsigned/signed add reductions. (PR #97339)

Mon Jul 15 06:20:02 PDT 2024

================
@@ -17455,6 +17456,77 @@ static SDValue performVecReduceAddCombineWithUADDLP(SDNode *N,
   return DAG.getNode(ISD::VECREDUCE_ADD, DL, MVT::i32, UADDLP);
 }
 
+// Turn [sign|zero]_extend(vecreduce_add()) into SVE's  SADDV|UADDV
+// instructions.
+static SDValue
+performVecReduceAddExtCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI,
+                              const AArch64TargetLowering &TLI) {
+  if (N->getOperand(0).getOpcode() != ISD::ZERO_EXTEND &&
+      N->getOperand(0).getOpcode() != ISD::SIGN_EXTEND)
+    return SDValue();
+  bool IsSigned = N->getOperand(0).getOpcode() == ISD::SIGN_EXTEND;
+
+  SelectionDAG &DAG = DCI.DAG;
+  auto &Subtarget = DAG.getSubtarget<AArch64Subtarget>();
+  SDValue VecOp = N->getOperand(0).getOperand(0);
+  SDLoc DL(N);
+
+  bool IsScalableType = VecOp.getValueType().isScalableVector();
+  std::deque<SDValue> ResultValues;
+  ResultValues.push_back(VecOp);
+
+  // Split the input vectors if not legal.
----------------
sdesmalen-arm wrote:

My suggestion would be to use the free recursion that SelectionDAG gives, rather than writing an explicit loop. This aligns with how most legalization code in SelectionDAG works and makes the code a little simpler.

That would require distinguishing two cases:

CASE 1: If the input operand to the extend requires splitting, then split the operation into two VECREDUCE_ADD operations with a scalar ADD to combine their results, and return that value.

Example:
```
i32 (vecreduce_add (zext nxv32i8 %op to nxv32i32))
->
i32 (add
  (i32 vecreduce_add (zext nxv16i8 %op.lo to nxv16i32)),
  (i32 vecreduce_add (zext nxv16i8 %op.hi to nxv16i32)))
```

The DAGCombiner would revisit all nodes created above. So when it revisits the vecreduce_add nodes and the input operand to the extend are legal, then the combine can map this directly to a UADDV/SADDV operation.

CASE 2: If the input operand to the extend is legal, then map directly to UADDV/SADDV operation.

Example:
```
i32 (vecreduce_add (zext nxv16i8 %op to nxv16i32))
->
i32 (UADDV nxv16i8:%op)
```

Locally I made some hacky changes to your patch to try this, and found that it seems to improve codegen for fixed-length vectors a little bit, and it also simplifies the logic in this function because there is no longer a need for a dequeue or iterating parts of the input vector.

https://github.com/llvm/llvm-project/pull/97339