[PATCH] D90781: [ARM] remove cost-kind predicate for cmp/sel costs

Thu Nov 5 08:44:43 PST 2020

dmgreen accepted this revision.
dmgreen added a comment.
This revision is now accepted and ready to land.

This looks fine to me, if Sam doesn't disagree. The costs look better to me, and the MVE costs should be our problem not yours :)

================
Comment at: llvm/test/Analysis/CostModel/ARM/intrinsic-cost-kinds.ll:220
 ; SIZE_LATE-LABEL: 'reduce_fmax'
-; SIZE_LATE-NEXT:  Cost Model: Found an estimated cost of 620 for instruction: %v = call float @llvm.vector.reduce.fmax.v16f32(<16 x float> %va)
+; SIZE_LATE-NEXT:  Cost Model: Found an estimated cost of 628 for instruction: %v = call float @llvm.vector.reduce.fmax.v16f32(<16 x float> %va)
 ; SIZE_LATE-NEXT:  Cost Model: Found an estimated cost of 1 for instruction: ret void
----------------
spatel wrote:
> samparker wrote:
> > I know this is a tiny change, but it's drawn my attention because it's so high the number is so high. Does this look right to you @dmgreen ? I would have thought we were able to break this up more efficiently with our native support, or is this because we'd have to copy the GPR registers into an FPRs to perform some final scalar maxs..?
> Stepping through this, the cost is derived from the BasicTTIImpl calling back to the target for shuffle+cmp+sel:
> 
> ```
>       // Assume the pairwise shuffles add a cost.
>       ShuffleCost +=
>           (IsPairwise + 1) * thisT()->getShuffleCost(TTI::SK_ExtractSubvector,
>                                                      Ty, NumVecElts, SubTy);
>       MinMaxCost +=
>           thisT()->getCmpSelInstrCost(CmpOpcode, SubTy, CondTy,
>                                       CmpInst::BAD_ICMP_PREDICATE, CostKind) +
>           thisT()->getCmpSelInstrCost(Instruction::Select, SubTy, CondTy,
>                                       CmpInst::BAD_ICMP_PREDICATE, CostKind);
> 
> ```
> 
> And this progresses for v16f32 as: 384 -> 480 -> 608 for the shuffles and 8 -> 12 -> 20 for the cmp/sel, so 608 + 20 = 628.
> The shuffle cost seems to be expanded as a series of insert/extract based on the number of elements in the vector, so it's exploding.
Our scalarization overhead is set very high, super high to protect us against regressions. I think it's actually O(n^2). It ends up meaning that wide vectors get very high costs, and was useful early on when we were getting some pretty ropy codegen. We have plans to set it to something more sensible but every time I try it still gives regressions somewhere. We only have 8 registers so sometimes these high costs are useful, especially in the vectorizer.

It sounds like we could do with some better SK_ExtractSubvector costs for extracting simple legal vector, although it may be better to just have custom reduction costs in this specific case, as I don't think that code will model how we actually lower reductions. I will try and look into that at some point, but for the moment they are very deliberately very wrong :)

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D90781/new/

https://reviews.llvm.org/D90781