[PATCH] D155459: [AArch64] Change the cost of vector insert/extract to 2

Wed Jul 19 00:27:22 PDT 2023

dmgreen added a comment.

> A lot of the numbers for intrinsics are pretty clearly off by a very large amount.  If nobody is going to look at them, maybe we should just kill off the tests in question so reviewers don't have to read meaningless updates to them?

A lot of them are at least roughly correct even if they look odd, and it is good to have the test coverage.

================
Comment at: llvm/test/Analysis/CostModel/AArch64/bswap.ll:45
 }

 define <2 x i32> @bswap_v2i32(<2 x i32> %a) {
----------------
SjoerdMeijer wrote:
> Just a bit of a drive by question first. It's not really caused by this change, I think, but it looks like the cost modelling was already a bit off for these bswaps? 
> 
> https://godbolt.org/z/d1s4ToP1G
> 
> Or am I missing something?
I don't know if anyone has looked into getting vector bswap costs correct in the past, just scalar. These will just be costed as if they were scalarized, so the changes here are really just saying that the scalarization overhead is changing a little.

I can look into improving some of them, but like you say it's a bit unrelated to this change.

================
Comment at: llvm/test/Analysis/CostModel/AArch64/fshr.ll:183
 ; CHECK-LABEL: 'fshr_v4i32_3rd_arg_var'
-; CHECK-NEXT:  Cost Model: Found an estimated cost of 36 for instruction: %fshr = tail call <4 x i32> @llvm.fshr.v4i32(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 34 for instruction: %fshr = tail call <4 x i32> @llvm.fshr.v4i32(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret <4 x i32> %fshr
----------------
efriedma wrote:
> Weird cost modeling.
Yep I think we have only looked at cost modelling for constant funnel shifts, and those were added fairly recently. I believe the codegen should also be improved for the variable case.

================
Comment at: llvm/test/Analysis/CostModel/AArch64/getIntrinsicInstrCost-vector-reverse.ll:25
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 28 for instruction: %15 = call <8 x bfloat> @llvm.experimental.vector.reverse.v8bf16(<8 x bfloat> undef)
+; CHECK-NEXT:  Cost Model: Found an estimated cost of 56 for instruction: %16 = call <16 x bfloat> @llvm.experimental.vector.reverse.v16bf16(<16 x bfloat> undef)
 ; CHECK-NEXT:  Cost Model: Found an estimated cost of 0 for instruction: ret void
----------------
efriedma wrote:
> Weird cost modeling.
I agree, but the codegen looks odd without +bf16 too: https://godbolt.org/z/oTG5ae6nP

================
Comment at: llvm/test/Analysis/CostModel/AArch64/shuffle-select.ll:35
 ; COST-LABEL: sel.v8i16
-; COST:       Found an estimated cost of 42 for instruction: %tmp0 = shufflevector <8 x i16> %v0, <8 x i16> %v1, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15>
+; COST:       Found an estimated cost of 28 for instruction: %tmp0 = shufflevector <8 x i16> %v0, <8 x i16> %v1, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15>
 ; CODE-LABEL: sel.v8i16
----------------
efriedma wrote:
> Weird cost modeling.
Do you mean because of the tbl? We have never costed tbls as cheap. I'm not sure if that would be profitable or not, and feels very much like a different issue.

================
Comment at: llvm/test/Analysis/CostModel/AArch64/vector-select.ll:692
 ; COST-LABEL: v8f16_select_une
-; COST-NOFP16-NEXT:  Cost Model: Found an estimated cost of 29 for instruction:   %cmp.1 = fcmp une <8 x half> %a, %b
+; COST-NOFP16-NEXT:  Cost Model: Found an estimated cost of 22 for instruction:   %cmp.1 = fcmp une <8 x half> %a, %b
 ; COST-NOFP16-NEXT:  Cost Model: Found an estimated cost of 2 for instruction:   %s.1 = select <8 x i1> %cmp.1, <8 x half> %a, <8 x half> %c
----------------
efriedma wrote:
> Cost modeling is weird.
Because it is too low? It is scalarized without +fullfp16. That codegen could be better, and it looks like the cost is a bit low, not accounting for the scalarization cost of the extracts. I don't think we have focussed much in the past on the combination of fp16 code without fullfp16.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/strict-fadd-cost.ll:53
 ; CHECK-VF4: Found an estimated cost of 1 for VF 4 For instruction:   %muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
-; CHECK-VF8: Found an estimated cost of 38 for VF 8 For instruction:   %muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)
+; CHECK-VF8: Found an estimated cost of 32 for VF 8 For instruction:   %muladd = tail call float @llvm.fmuladd.f32(float %0, float %1, float %sum.07)

----------------
efriedma wrote:
> This cost modeling is weird.
This is an in-order reduction.

================
Comment at: llvm/test/Transforms/SLPVectorizer/AArch64/slp-fma-loss.ll:204

 ; Test case where not vectorizing is more profitable because multiple
 ; fmul/{fadd,fsub} pairs can be lowered to fma instructions.
----------------
efriedma wrote:
> Regression?
I had looked into these. This test case was added without the underlying issue being fixed (fusing fmul and fadd). The tests were changed by the increased cost in ld1r instructions.

In this case it just profitable again now. You can see it picks 2x vectorization though, not 4x, which seems to come because of the `insertelement <2 x float> <float poison, float 3.000000e+00>, float [[X:%.*]], i32 0`, which is counts as the cost of a constant vector with x inserted into the bottom lane, both of which are incorrectly counted as zero.

================
Comment at: llvm/test/Transforms/VectorCombine/AArch64/load-extractelement-scalarization.ll:521
 ; Scalarizing the load for multiple extracts is profitable in this case,
 ; because the vector large vector requires 2 vector registers.
 define i32 @load_multiple_extracts_with_constant_idx_profitable(ptr %x) {
----------------
efriedma wrote:
> Regression?
The cost of an extract of lane zero is still 0 (which is known to be wrong but doesn't look like something we can change without causing too many regressions. I was really hoping to remove it for integer type at the same time as this, but it looks like it causes too many problems to remove. I'm hoping that can be improved in the future, and that will hopefully be easier if the base scalar cost is lower).

This seems to already handled by instruction selection https://godbolt.org/z/7GEcxo8WT, so shouldnt be a problem on its own. I can change the test to use lane 1 to show it still applies for other lanes.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D155459/new/

https://reviews.llvm.org/D155459