[llvm] [SLP] Vectorize non-power-of-2 ops. (PR #77790)

Fri Jan 19 11:52:34 PST 2024

================
@@ -668,21 +702,32 @@ entry:
 }
 
 define void @add1fn(ptr noalias %dst, ptr noalias %src) {
-; CHECK-LABEL: @add1fn(
-; CHECK-NEXT:  entry:
-; CHECK-NEXT:    [[INCDEC_PTR:%.*]] = getelementptr inbounds float, ptr [[SRC:%.*]], i64 1
-; CHECK-NEXT:    [[TMP0:%.*]] = load float, ptr [[SRC]], align 4
-; CHECK-NEXT:    [[INCDEC_PTR1:%.*]] = getelementptr inbounds float, ptr [[DST:%.*]], i64 1
-; CHECK-NEXT:    store float [[TMP0]], ptr [[DST]], align 4
-; CHECK-NEXT:    [[INCDEC_PTR5:%.*]] = getelementptr inbounds float, ptr [[SRC]], i64 3
-; CHECK-NEXT:    [[INCDEC_PTR7:%.*]] = getelementptr inbounds float, ptr [[DST]], i64 3
-; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x float>, ptr [[INCDEC_PTR]], align 4
-; CHECK-NEXT:    [[TMP3:%.*]] = fadd <2 x float> [[TMP2]], <float 1.000000e+00, float 2.000000e+00>
-; CHECK-NEXT:    store <2 x float> [[TMP3]], ptr [[INCDEC_PTR1]], align 4
-; CHECK-NEXT:    [[TMP5:%.*]] = load float, ptr [[INCDEC_PTR5]], align 4
-; CHECK-NEXT:    [[ADD9:%.*]] = fadd float [[TMP5]], 3.000000e+00
-; CHECK-NEXT:    store float [[ADD9]], ptr [[INCDEC_PTR7]], align 4
-; CHECK-NEXT:    ret void
+; NON-POW2-LABEL: @add1fn(
+; NON-POW2-NEXT:  entry:
+; NON-POW2-NEXT:    [[INCDEC_PTR:%.*]] = getelementptr inbounds float, ptr [[SRC:%.*]], i64 1
+; NON-POW2-NEXT:    [[TMP0:%.*]] = load float, ptr [[SRC]], align 4
+; NON-POW2-NEXT:    [[INCDEC_PTR1:%.*]] = getelementptr inbounds float, ptr [[DST:%.*]], i64 1
+; NON-POW2-NEXT:    store float [[TMP0]], ptr [[DST]], align 4
+; NON-POW2-NEXT:    [[TMP1:%.*]] = load <3 x float>, ptr [[INCDEC_PTR]], align 4
+; NON-POW2-NEXT:    [[TMP2:%.*]] = fadd <3 x float> [[TMP1]], <float 1.000000e+00, float 2.000000e+00, float 3.000000e+00>
+; NON-POW2-NEXT:    store <3 x float> [[TMP2]], ptr [[INCDEC_PTR1]], align 4
----------------
fhahn wrote:

(Moved comment  here https://github.com/llvm/llvm-project/commit/0ddbdd3052e694eedef6bccce3b17f92dff7add7#r137302312)
> Is this supported on all platforms? Or you rely on the check that this will end up with the scalarized version in the end?
 
I checked on AArch64 and X86 and both support OK lowering of such operations, although in a number of cases improvements can be made, e.g. #78632 and #78637 for loads and stores of <3 x i8> on AArch64.

This effectively relies on the cost-models to return accurate costs for non-power-of-2 vector operations. If a vector op of a non-power-of-2 type will get scalarizied in the backend, the cost model should return a high cost.


https://github.com/llvm/llvm-project/pull/77790