[llvm] [AArch64] Improve codegen for some fixed-width partial reductions (PR #126529)

Mon Feb 10 07:29:13 PST 2025

================
@@ -16866,9 +16866,14 @@ bool AArch64TargetLowering::optimizeExtendOrTruncateConversion(
     // mul(zext(i8), sext) can be transformed into smull(zext, sext) which
     // performs one extend implicitly. If DstWidth is at most 4 * SrcWidth, at
     // most one extra extend step is needed and using tbl is not profitable.
+    // Similarly, bail out if partial_reduce(acc, zext(i8)) can be lowered to a
+    // udot instruction.
     if (SrcWidth * 4 <= DstWidth && I->hasOneUser()) {
       auto *SingleUser = cast<Instruction>(*I->user_begin());
-      if (match(SingleUser, m_c_Mul(m_Specific(I), m_SExt(m_Value()))))
+      if (match(SingleUser, m_c_Mul(m_Specific(I), m_SExt(m_Value()))) ||
+          (isa<IntrinsicInst>(SingleUser) &&
+           !shouldExpandPartialReductionIntrinsic(
----------------
david-arm wrote:

Currently `shouldExpandPartialReductionIntrinsic` does not check whether the target actually has support for the udot/sdot, but the loop vectoriser should not be generating partial reduction intrinsic calls in that case.

https://github.com/llvm/llvm-project/pull/126529