[llvm] [LV] Enable considering higher VFs when data extend ops are present i… (PR #137593)

Tue May 13 06:09:03 PDT 2025

================
@@ -362,10 +362,15 @@ AArch64TTIImpl::getInlineCallPenalty(const Function *F, const CallBase &Call,
 }
 
 bool AArch64TTIImpl::shouldMaximizeVectorBandwidth(
-    TargetTransformInfo::RegisterKind K) const {
+    TargetTransformInfo::RegisterKind K, const unsigned WidestType,
+    const unsigned SmallestType) const {
   assert(K != TargetTransformInfo::RGK_Scalar);
-  return (K == TargetTransformInfo::RGK_FixedWidthVector &&
-          ST->isNeonAvailable());
+  // For loops with extend operations e.g. zext, sext etc., limiting the max VF
+  // based on widest type inhibits considering higher VFs even though
+  // vectorizing with higher VF might be profitable. In such cases, we should
+  // limit the max VF based on smallest type and the decision whether a
+  // particular VF is beneficial or not be left to cost model.
+  return WidestType != SmallestType;
----------------
huntergr-arm wrote:

This PR does conflict somewhat with current (undocumented) plans; the intent is to enable max bandwidth for the scalable vector register kind by default as soon as we've fixed some regressions we're aware of (at least for cores implementing SVE2 or higher, specifics tbd.). I did try enabling this before but reverted once we found the regressions. We don't need to know the smallest and largest types to do so, as we're improving the cost model to reject suboptimal VFs.

Some PRs that should hopefully let us enable maxbw by default once they all land:
* #137746 -- Allows vplan to bundle up sequences of operations into a meta-recipe (VPMulAccumulateReductionRecipe) before modeling the cost.
* #113903 -- Implements cost modeling for the above PR. (Or will do once rebased, as the work was split up).
* #136997 -- Extends the VPMulAccumulateReductionRecipe to support differing extension types to support usdot instructions.
* #132190 -- Prunes vplans with wider VFs if the estimated register pressure would be too high; doing it here after we know about partial reductions lets us model things better instead of assuming we'll have phi nodes with too-wide types taking up multiple registers as the legacy cost model does now.

We'll probably need a few more improvements later as we run more benchmarks, but those PRs cover the basic mechanisms needed for now.

https://github.com/llvm/llvm-project/pull/137593