[llvm] [AArch64][CostModel] Reduce cost of wider than legal get.active.lane.mask (PR #163786)

Tue Oct 21 03:34:24 PDT 2025

================
@@ -957,10 +957,24 @@ AArch64TTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
     return TyL.first + ExtraCost;
   }
   case Intrinsic::get_active_lane_mask: {
-    auto *RetTy = dyn_cast<FixedVectorType>(ICA.getReturnType());
-    if (RetTy) {
-      EVT RetVT = getTLI()->getValueType(DL, RetTy);
-      EVT OpVT = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
+    auto RetTy = cast<VectorType>(ICA.getReturnType());
+    EVT RetVT = getTLI()->getValueType(DL, RetTy);
+    EVT OpVT = getTLI()->getValueType(DL, ICA.getArgTypes()[0]);
+    if (RetTy->isScalableTy()) {
+      // When SVE2p1 or SME2 is available, get_active_lane_mask will lower
+      // to the sve_whilelo_x2 intrinsic which returns a predicate pair.
+      // This means we can halve getTypeLegalizationCost, since the
+      // predicate pair intrinsic will split the result, e.g.
+      //   nxv32i1 = get_active_lane_mask(base, idx) ->
+      //    {nxv16i1, nxv16i1} = sve_whilelo_x2(base, idx)
+      if (getTLI()->shouldExpandGetActiveLaneMask(RetVT, OpVT) ||
+          (!ST->hasSVE2p1() && !ST->hasSME2()) ||
+          TLI->getTypeAction(RetTy->getContext(), RetVT) !=
+              TargetLowering::TypeSplitVector)
+        break;
+      auto LT = getTypeLegalizationCost(RetTy);
+      return LT.first / 2;
----------------
david-arm wrote:

Do we need to add on the fixed cost of generating the saturating adds for each part? For example, we may need N-1 saturating adds (an adds and a csinv instruction) where N is the number of actual while instructions required.

https://github.com/llvm/llvm-project/pull/163786