[PATCH] D155459: [AArch64] Change the cost of vector insert/extract to 2

Wed Jul 19 01:24:23 PDT 2023

efriedma added inline comments.

================
Comment at: llvm/test/Analysis/CostModel/AArch64/getIntrinsicInstrCost-vector-reverse.ll:25
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %15 = call <8 x bfloat> @llvm.experimental.vector.reverse.v8bf16(<8 x bfloat> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %16 = call <16 x bfloat> @llvm.experimental.vector.reverse.v16bf16(<16 x bfloat> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
----------------
dmgreen wrote:
> efriedma wrote:
> > Weird cost modeling.
> I agree, but the codegen looks odd without +bf16 too: https://godbolt.org/z/oTG5ae6nP
Still way overestimating the cost... but yes.

================
Comment at: llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll:35
 ; COST-LABEL: sel.v8i16
-; COST:       Found an estimated cost of 42 for instruction: %tmp0 = shufflevector <8 x i16> %v0, <8 x i16> %v1, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15>
+; COST:       Found an estimated cost of 28 for instruction: %tmp0 = shufflevector <8 x i16> %v0, <8 x i16> %v1, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15>
 ; CODE-LABEL: sel.v8i16
----------------
dmgreen wrote:
> efriedma wrote:
> > Weird cost modeling.
> Do you mean because of the tbl? We have never costed tbls as cheap. I'm not sure if that would be profitable or not, and feels very much like a different issue.
We should be modeling the fact that tbl exists, at least.  (I mean, it doesn't need to be super-cheap, but basically all ARM chips have a reasonably fast tbl.)

================
Comment at: llvm/test/Analysis/CostModel/AArch64/vector-select.ll:692
 ; COST-LABEL: v8f16_select_une
-; COST-NOFP16-NEXT:  Cost Model: Found an estimated cost of 29 for instruction:   %cmp.1 = fcmp une <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found an estimated cost of 22 for instruction:   %cmp.1 = fcmp une <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:   %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
----------------
dmgreen wrote:
> efriedma wrote:
> > Cost modeling is weird.
> Because it is too low? It is scalarized without +fullfp16. That codegen could be better, and it looks like the cost is a bit low, not accounting for the scalarization cost of the extracts. I don't think we have focussed much in the past on the combination of fp16 code without fullfp16.
Wait, we scalarize this?  I thought I checked this, but must not have.  We really shouldn't scalarize, though.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll:53
 ; CHECK-VF4: Found an estimated cost of 1 for VF 4 For instruction:   %muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
-; CHECK-VF8: Found an estimated cost of 38 for VF 8 For instruction:   %muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
+; CHECK-VF8: Found an estimated cost of 32 for VF 8 For instruction:   %muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)

----------------
dmgreen wrote:
> efriedma wrote:
> > This cost modeling is weird.
> This is an in-order reduction.
Then why does it cost 1 at VF 4?

================
Comment at: llvm/test/Transforms/VectorCombine/AArch64/load-extractelement-scalarization.ll:521
 ; Scalarizing the load for multiple extracts is profitable in this case,
 ; because the vector large vector requires 2 vector registers.
 define i32 @load_multiple_extracts_with_constant_idx_profitable(ptr %x) {
----------------
dmgreen wrote:
> efriedma wrote:
> > Regression?
> The cost of an extract of lane zero is still 0 (which is known to be wrong but doesn't look like something we can change without causing too many regressions. I was really hoping to remove it for integer type at the same time as this, but it looks like it causes too many problems to remove. I'm hoping that can be improved in the future, and that will hopefully be easier if the base scalar cost is lower).
> 
> This seems to already handled by instruction selection https://godbolt.org/z/7GEcxo8WT, so shouldnt be a problem on its own. I can change the test to use lane 1 to show it still applies for other lanes.
Okay.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D155459/new/

https://reviews.llvm.org/D155459