[llvm] [LoopVectorizer] Add support for partial reductions (PR #92418)

Wed Sep 18 08:08:41 PDT 2024

================
@@ -6871,6 +6974,18 @@ void LoopVectorizationCostModel::collectValuesToIgnore() {
     const SmallVectorImpl<Instruction *> &Casts = IndDes.getCastInsts();
     VecValuesToIgnore.insert(Casts.begin(), Casts.end());
   }
+
+  // Ignore any values that we know will be flattened
+  for (auto It : getPartialReductionChains()) {
+    PartialReductionChain Chain = It.second;
+    SmallVector<Value *> PartialReductionValues{Chain.Reduction, Chain.BinOp,
+                                                Chain.ExtendA, Chain.ExtendB,
+                                                Chain.Accumulator};
+    ValuesToIgnore.insert(PartialReductionValues.begin(),
+                          PartialReductionValues.end());
+    VecValuesToIgnore.insert(PartialReductionValues.begin(),
----------------
huntergr-arm wrote:

NEON uses a VF of 16 because we do return true for `shouldMaximizeVectorBandwidth()` when the register type is fixed length vectors. It doesn't lower to a dot product because we're looking for the partial reduction intrinsic, which isn't being emitted yet.

I wonder if (for now) we can try always maximizing the bandwidth if there's an add reduction in the loop and the target supports partial reductions, then deal with detecting the specifics during vplan construction/transformation.

I didn't see any major swings in spec2K17 performance when enabling max vector bandwidth for SVE, but there might be a build failure in one benchmark since it didn't run with that change in. I'll take a closer look.

https://github.com/llvm/llvm-project/pull/92418