[llvm] [LV][VPlan] Add initial support for CSA vectorization (PR #106560)

Fri Oct 11 07:34:54 PDT 2024

================
@@ -0,0 +1,2266 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt < %s -S -mtriple=riscv64 -mattr=+v -riscv-v-vector-bits-min=256 \
+; RUN:   -passes=loop-vectorize -force-tail-folding-style=data-with-evl \
+; RUN:   -prefer-predicate-over-epilogue=predicate-dont-vectorize  \
+; RUN:   -enable-csa-vectorization | FileCheck %s -check-prefixes=CHECK,EVL
+; RUN: opt < %s -S -passes=loop-vectorize -mtriple=riscv64 -mattr=+v \
+; RUN:   -force-tail-folding-style=none -enable-csa-vectorization \
+; RUN:   | FileCheck %s -check-prefixes=CHECK,NO-EVL
+
+; This function is generated from the following C/C++ program:
+; int simple_csa_int_select(int N, int *data, int a) {
+;   int t = -1;
+;   for (int i = 0; i < N; i++) {
+;     if (a < data[i])
+;       t = data[i];
+;   }
+;   return t; // use t
+; }
+define i32 @simple_csa_int_select(i32 %N, ptr %data, i64 %a) {
+; EVL-LABEL: define i32 @simple_csa_int_select(
+; EVL-SAME: i32 [[N:%.*]], ptr [[DATA:%.*]], i64 [[A:%.*]]) #[[ATTR0:[0-9]+]] {
+; EVL-NEXT:  [[ENTRY:.*]]:
+; EVL-NEXT:    [[CMP9:%.*]] = icmp sgt i32 [[N]], 0
+; EVL-NEXT:    br i1 [[CMP9]], label %[[LOOP_PREHEADER:.*]], label %[[EXIT:.*]]
+; EVL:       [[LOOP_PREHEADER]]:
+; EVL-NEXT:    [[WIDE_TRIP_COUNT:%.*]] = zext i32 [[N]] to i64
+; EVL-NEXT:    br i1 false, label %[[SCALAR_PH:.*]], label %[[VECTOR_PH:.*]]
+; EVL:       [[VECTOR_PH]]:
+; EVL-NEXT:    [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
+; EVL-NEXT:    [[TMP17:%.*]] = mul i64 [[TMP16]], 4
+; EVL-NEXT:    [[TMP2:%.*]] = sub i64 [[TMP17]], 1
+; EVL-NEXT:    [[N_RND_UP:%.*]] = add i64 [[WIDE_TRIP_COUNT]], [[TMP2]]
+; EVL-NEXT:    [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP17]]
+; EVL-NEXT:    [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
+; EVL-NEXT:    [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
+; EVL-NEXT:    [[TMP4:%.*]] = mul i64 [[TMP3]], 4
+; EVL-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[A]], i64 0
+; EVL-NEXT:    [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
+; EVL-NEXT:    br label %[[VECTOR_BODY:.*]]
+; EVL:       [[VECTOR_BODY]]:
+; EVL-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; EVL-NEXT:    [[EVL_BASED_IV:%.*]] = phi i64 [ 0, %[[VECTOR_PH]] ], [ [[INDEX_EVL_NEXT:%.*]], %[[VECTOR_BODY]] ]
+; EVL-NEXT:    [[CSA_MASK_PHI:%.*]] = phi <vscale x 4 x i1> [ zeroinitializer, %[[VECTOR_PH]] ], [ [[CSA_MASK_SEL:%.*]], %[[VECTOR_BODY]] ]
+; EVL-NEXT:    [[CSA_DATA_PHI:%.*]] = phi <vscale x 4 x i32> [ poison, %[[VECTOR_PH]] ], [ [[CSA_DATA_SEL:%.*]], %[[VECTOR_BODY]] ]
+; EVL-NEXT:    [[CSA_VL_PHI:%.*]] = phi i32 [ 0, %[[VECTOR_PH]] ], [ [[CSA_VL_SEL:%.*]], %[[VECTOR_BODY]] ]
+; EVL-NEXT:    [[AVL:%.*]] = sub i64 [[WIDE_TRIP_COUNT]], [[EVL_BASED_IV]]
+; EVL-NEXT:    [[TMP5:%.*]] = call i32 @llvm.experimental.get.vector.length.i64(i64 [[AVL]], i32 4, i1 true)
+; EVL-NEXT:    [[TMP6:%.*]] = add i64 [[EVL_BASED_IV]], 0
+; EVL-NEXT:    [[TMP7:%.*]] = getelementptr inbounds i32, ptr [[DATA]], i64 [[TMP6]]
+; EVL-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i32, ptr [[TMP7]], i32 0
+; EVL-NEXT:    [[VP_OP_LOAD:%.*]] = call <vscale x 4 x i32> @llvm.vp.load.nxv4i32.p0(ptr align 4 [[TMP8]], <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), i32 [[TMP5]])
+; EVL-NEXT:    [[TMP9:%.*]] = sext <vscale x 4 x i32> [[VP_OP_LOAD]] to <vscale x 4 x i64>
+; EVL-NEXT:    [[TMP10:%.*]] = icmp slt <vscale x 4 x i64> [[BROADCAST_SPLAT]], [[TMP9]]
+; EVL-NEXT:    [[CSA_COND_ANYACTIVE:%.*]] = call i1 @llvm.vp.reduce.or.nxv4i1(i1 false, <vscale x 4 x i1> [[TMP10]], <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), i32 [[TMP5]])
+; EVL-NEXT:    [[CSA_VL_SEL]] = select i1 [[CSA_COND_ANYACTIVE]], i32 [[TMP5]], i32 [[CSA_VL_PHI]]
----------------
huntergr-arm wrote:

I'm curious what the eventual asm looks like for this. I had a look through the RVV spec and couldn't find an instruction which selected between whole registers based on a single condition, just merging moves which do a per-lane select. I may have missed something though. Does this require branches inside the loop?

I ask because I'm trying to evaluate this work from an SVE perspective, and I think I would prefer to do the extract inside the loop using `clastb` with a scalar phi instead of outside the loop with a branch on the inside to conditionally update the whole register.

I guess we may end up with diverging recipes for vectorizing the same pattern.

https://github.com/llvm/llvm-project/pull/106560