[llvm] [AArch64][LV] Reduce cost of scaled reduction extends (PR #134074)

Thu Apr 3 05:42:51 PDT 2025

================
@@ -1757,6 +1766,43 @@ void VPWidenCastRecipe::execute(VPTransformState &State) {
     setFlags(CastOp);
 }
 
+// Detects whether the extension should be folded away into a combined
+// target instruction, and therefore given a cost of 0.
+// Handles patterns similar to the following:
+//   * partial_reduce(ext, phi)
+//   * partial_reduce(mul(ext, ext), phi)
+//   * partial_reduce(sub(0, mul(ext, ext)), phi)
+static bool isScaledReductionExtension(const VPWidenCastRecipe *Extend) {
+  unsigned Opcode = Extend->getOpcode();
+  if (Opcode != Instruction::SExt && Opcode != Instruction::ZExt)
+    return false;
+
+  // Check that all users are either a partial reduction, or a multiply
+  // (and possibly subtract) used by a partial reduction.
+  return all_of(Extend->users(), [](const VPUser *U) {
+    // Look through a (possible) multiply.
+    if (const VPWidenRecipe *I = dyn_cast_if_present<VPWidenRecipe>(U)) {
----------------
david-arm wrote:

Hmm, whilst this may be true for aarch64 I wonder if it's correct in general to assume that a partial reduction by definition folds a mul into a udot? It's my understanding that at the IR level partial reductions are far more abstract than just a udot or sdot. At the IR level we're simply partially reducing a set of values into a smaller set. It's quite conceivable that a target has support for this that doesn't involve muls, i.e. an instruction that sums up each 4 bytes of an input and accumulates in 32-bit result? In which case the mul is not free. At the moment this does like we're taking a AArch64 cost model and using it in a general way for everyone.

https://github.com/llvm/llvm-project/pull/134074