[PATCH] D42981: [COST] Fix cost model of load instructions on X86

Thu Feb 25 06:44:56 PST 2021

ABataev added a comment.

Roman, yes, this patch lowers the cost of scalar instruction and lowers the difference between the costs of vector variant and scalar variant of the code. For x86 in some cases, the scalar instructions with memaccesses are more profitable than the vector variant of the same instructions (pair of load + binop, actually).

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/interleaving.ll:35-47
+; AVX-NEXT:    [[INDVARS_IV:%.*]] = phi i64 [ 0, [[ENTRY:%.*]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
+; AVX-NEXT:    [[TMP0:%.*]] = shl nuw nsw i64 [[INDVARS_IV]], 1
+; AVX-NEXT:    [[ARRAYIDX:%.*]] = getelementptr inbounds i32, i32* [[B:%.*]], i64 [[TMP0]]
+; AVX-NEXT:    [[TMP1:%.*]] = load i32, i32* [[ARRAYIDX]], align 4
+; AVX-NEXT:    [[TMP2:%.*]] = or i64 [[TMP0]], 1
+; AVX-NEXT:    [[ARRAYIDX3:%.*]] = getelementptr inbounds i32, i32* [[B]], i64 [[TMP2]]
+; AVX-NEXT:    [[TMP3:%.*]] = load i32, i32* [[ARRAYIDX3]], align 4
----------------
lebedev.ri wrote:
> No longer vectorized for AVX? To me that reads as if we've increased the cost of vector variant.
Yes, it looks like it increases the cost of vector code, actually, it lowers the difference between the cost of the scalar code and vector code. Most probably, here we have a corner case, where the cost of vectorized code is the same as the cost of the scalar code.

================
Comment at: llvm/test/Transforms/SLPVectorizer/X86/load-merge-inseltpoison.ll:156
 ; CHECK-NEXT:    [[Q3:%.*]] = getelementptr inbounds i64, i64* [[Q]], i64 3
-; CHECK-NEXT:    [[TMP1:%.*]] = bitcast i64* [[P0]] to <2 x i64>*
-; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i64>, <2 x i64>* [[TMP1]], align 2
-; CHECK-NEXT:    [[TMP3:%.*]] = bitcast i64* [[P2]] to <2 x i64>*
-; CHECK-NEXT:    [[TMP4:%.*]] = load <2 x i64>, <2 x i64>* [[TMP3]], align 2
-; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i64* [[Q0]] to <2 x i64>*
-; CHECK-NEXT:    [[TMP6:%.*]] = load <2 x i64>, <2 x i64>* [[TMP5]], align 2
-; CHECK-NEXT:    [[TMP7:%.*]] = bitcast i64* [[Q2]] to <2 x i64>*
-; CHECK-NEXT:    [[TMP8:%.*]] = load <2 x i64>, <2 x i64>* [[TMP7]], align 2
-; CHECK-NEXT:    [[TMP9:%.*]] = sub nsw <2 x i64> [[TMP2]], [[TMP6]]
-; CHECK-NEXT:    [[TMP10:%.*]] = sub nsw <2 x i64> [[TMP4]], [[TMP8]]
-; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0
-; CHECK-NEXT:    [[G0:%.*]] = getelementptr inbounds i32, i32* [[R:%.*]], i64 [[TMP11]]
-; CHECK-NEXT:    [[TMP12:%.*]] = extractelement <2 x i64> [[TMP9]], i32 1
-; CHECK-NEXT:    [[G1:%.*]] = getelementptr inbounds i32, i32* [[R]], i64 [[TMP12]]
-; CHECK-NEXT:    [[TMP13:%.*]] = extractelement <2 x i64> [[TMP10]], i32 0
-; CHECK-NEXT:    [[G2:%.*]] = getelementptr inbounds i32, i32* [[R]], i64 [[TMP13]]
-; CHECK-NEXT:    [[TMP14:%.*]] = extractelement <2 x i64> [[TMP10]], i32 1
-; CHECK-NEXT:    [[G3:%.*]] = getelementptr inbounds i32, i32* [[R]], i64 [[TMP14]]
+; CHECK-NEXT:    [[X0:%.*]] = load i64, i64* [[P0]], align 2
+; CHECK-NEXT:    [[X1:%.*]] = load i64, i64* [[P1]], align 2
----------------
lebedev.ri wrote:
> Same, no longer vectorized?
> 
> 
Yes, because it is profitable to execute the scalar code, not the vector variant.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D42981/new/

https://reviews.llvm.org/D42981