[PATCH] D153696: [LV] Only generate 1st part outside of vector region for VPInstruction.

Sun Jul 2 02:14:08 PDT 2023

Ayal added a comment.

Looks good to me, thanks!
May be good to reason about the VPlan recipes and/or show their dump for tests where changes in the IR are harder to track due to glitches in the diff.

================
Comment at: llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp:376
+    if (!hasResult()) {
+      assert(
+          !GeneratedValue &&
----------------
The other assert is also/more important - if hasResult then GeneratedValue being set in State must not be null?

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll:394
 ; CHECK-ORDERED-TF-NEXT:    [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
 ; CHECK-ORDERED-TF-NEXT:    [[TMP3:%.*]] = mul i64 [[TMP2]], 32
 ; CHECK-ORDERED-TF-NEXT:    [[TMP4:%.*]] = sub i64 [[TMP3]], 1
----------------
While we're here, vscale seems to provide additional opportunities above for deduplication in the preheader.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll:403
+; CHECK-ORDERED-TF-NEXT:    [[TMP8:%.*]] = icmp ugt i64 [[N]], [[TMP6]]
+; CHECK-ORDERED-TF-NEXT:    [[TMP9:%.*]] = select i1 [[TMP8]], i64 [[TMP7]], i64 0
 ; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[N]])
----------------
Suffice to have one instance of this {call-vscale;mul-by-**32**;sub;icmp;select} sequence defined in the preheader than four uniform replicas. But what's the reason for eliminating the {call-vscale;mul-by-**8/16/24**;add} **non-identical** sequences defined in the preheader, which are not uniform? Would be good to clarify if they all stem from the same recipe, and what the UF is.

Note that, in general, a recipe placed in the preheader could prepare a distinct value per each part. E.g., to initialize an add reduction with distinct starting value in first lane/part, rather than adding it to the sum at the end. But suffice to generate a uniform-across-UF value for a single part, and reuse it across all parts.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll:404
+; CHECK-ORDERED-TF-NEXT:    [[TMP9:%.*]] = select i1 [[TMP8]], i64 [[TMP7]], i64 0
 ; CHECK-ORDERED-TF-NEXT:    [[ACTIVE_LANE_MASK_ENTRY:%.*]] = call <vscale x 8 x i1> @llvm.get.active.lane.mask.nxv8i1.i64(i64 0, i64 [[N]])
 ; CHECK-ORDERED-TF-NEXT:    br label [[VECTOR_BODY:%.*]]
----------------
Fine - suffice to have one instance of this call-active-lane-mask defined in the preheader than four replicas.

================
Comment at: llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll:406
 ; CHECK-ORDERED-TF-NEXT:    br label [[VECTOR_BODY:%.*]]
 ; CHECK-ORDERED-TF:       vector.body:
 ; CHECK-ORDERED-TF-NEXT:    [[INDEX:%.*]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.*]], [[VECTOR_BODY]] ]
----------------
Fine - there are no real changes inside vector.body, just a misalignment in the diff.

================
Comment at: llvm/test/Transforms/LoopVectorize/scalable-first-order-recurrence.ll:205
 ; CHECK-VF4UF2: %[[SPLAT_INS1:.*]] = insertelement <vscale x 4 x i32> poison, i32 %x, i64 0
 ; CHECK-VF4UF2: %[[SPLAT1:.*]] = shufflevector <vscale x 4 x i32> %[[SPLAT_INS1]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
 ; ; CHECK-VF4UF2: vector.body
----------------
Fine - suffice to have a single instance of this identical broadcasting {insert; shuffle} sequence in the preheader.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153696/new/

https://reviews.llvm.org/D153696