[PATCH] D148841: [LV] Use SCEV for uniformity analysis across VF

Wed May 24 02:14:51 PDT 2023

Ayal added a subscriber: simoll.
Ayal added a comment.

+ at simoll.

This hopefully impacts which VPlans are built across the possible VF range, say prefer VF=2 over VF=4 when the former has more uniforms than the latter (but none are fully invariant, as detected today, so covered by same VPlan), although undetected by current tests?

Adding various nits.

================
Comment at: llvm/include/llvm/Analysis/LoopAccessAnalysis.h:591-592

-  /// Returns true if the value V is uniform within the loop.
-  bool isUniform(Value *V) const;
+  /// Returns true if the value V is uniform within the loop. If \p VF is
+  /// provided, check if \p V is uniform across \p VF.
+  bool isUniform(Value *V, std::optional<ElementCount> VF = std::nullopt) const;
----------------

================
Comment at: llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h:351
+  /// Returns true if the value V is uniform within the loop. If \p VF is
+  /// provided, check if \p V is uniform across \p VF.
+  bool isUniform(Value *V, std::optional<ElementCount> VF = std::nullopt) const;
----------------
ditto

================
Comment at: llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h:354-356
   /// A uniform memory op is a load or store which accesses the same memory
-  /// location on all lanes.
-  bool isUniformMemOp(Instruction &I) const;
+  /// location on all lanes.  If \p VF is provided, check if \p I is uniform
+  /// across \p VF.
----------------
nit: w/o VF, uniformity falls back to loop invariance.

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2543
+namespace {
+/// Rewriter to replace AddRecs for TheLoop with new AddRecs where the step is
+/// multiplied by StepMultiplier and Offset * Step is added. Also checks if all
----------------
"Rewriter is designed to build the SCEVs for each of the VF lanes in the expected vectorized loop, which can then be compared to detect their uniformity.
This is done by replacing the AddRec SCEVs of the original scalar loop with new AddRecs ..."

Should the name "SCEVAddRecRewriter" convey what it's for?

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2561
+  /// Set to true if the expression contains an UDiv expression. This is used to
+  /// limit the number of analyzed expressions for compile-time reasons.
+  bool FoundUDiv = false;
----------------
nit: explain why a non-loop-invariant uniform value is expected to involve a UDiv. Would an SDiv also work?
This saves time by potentially bailing out after building a single (UDiv-free) expression for FirstLane, w/o building another expression for another lane, rather than saving in building an expression itself.

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2572
+    assert(Expr->getLoop() == TheLoop &&
+           "addrec outside of TheLoop must be invariant");
+    // Build a new AddRec by multiplying the step by StepMultiplier and
----------------
nit: right, point (of error message) is that such addrec's should have been checked earlier?

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2600
+    // The value could vary across iterations.
+    CannotAnalyze = true;
+    return S;
----------------
nit: return a SCEVCouldNotCompute instead, SCEV's inherent 'CannotAnalyze'?

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2610
+
+  bool canAnalyze() const { return FoundUDiv && !CannotAnalyze; }
+};
----------------
nit: can return SCEVCouldNotCompute (or null) at the end if not FoundUDiv.

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2631
+  // Rewrite AddRecs in TheLoop to step by VF and create 2 expressions, one
+  // computing the value of the first and one for the last vector lane. Consider
+  // V uniform if the SCEV expressions are equal.
----------------
Update comment, fold LastLane into an additional VF-1 iteration of the loop, drop the max(1,VF-2).

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2649
+      auto *IthLaneExpr = IthLaneRewriter.visit(S);
+      if (!IthLaneRewriter.canAnalyze() || FirstLaneExpr != IthLaneExpr)
+        return false;
----------------
nit: suffice to check that SCEVs are equal?

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/uniform_mem_op.ll:427
+; CHECK-NEXT:    [[TMP39]] = add <4 x i32> [[VEC_PHI6]], [[TMP35]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <4 x i64> [[STEP_ADD2]], <i64 4, i64 4, i64 4, i64 4>
----------------
note: VF=4 is mandated, previous UF=2 decision with 8 loads is now UF=4 with 4 (uniform) loads & broadcasts.

The load from test_base[i/8] could further fold two 'parts' of VF=4 together, as it is uniform across VF=8 / VF=4,UF=2.
Worth leaving behind some **assume**, if not folding directly? I.e., record the largest VF for which uniformity was detected, even if a smaller VF is selected.
Worth optimizing the analysis across VF's, i.e., if a value is known not to be uniform for VF avoid checking for 2*VF? OTOH, if a value is known to be uniform for VF, check only two SCEVs for 2*VF?

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_and.ll:78
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[TMP2]], align 8
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[TMP3]], i64 0
----------------
note: this load from A[iv & -2] is now recognized as uniform across VF=2.

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_div_urem.ll:6

 ; for (iv = 0 ; ; iv += 1) B[iv] = A[(iv/2)%3];
 define void @ld_div2_urem3_1(ptr noalias %A, ptr noalias %B) {
----------------
note: load from A[(iv/2)%3] rightfully not recognized as uniform for VF=8.

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_div_urem.ll:294
 ; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[TMP1:%.*]] = udiv <8 x i64> [[VEC_IND]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>
-; CHECK-NEXT:    [[TMP2:%.*]] = urem <8 x i64> [[TMP1]], <i64 3, i64 3, i64 3, i64 3, i64 3, i64 3, i64 3, i64 3>
-; CHECK-NEXT:    [[TMP3:%.*]] = extractelement <8 x i64> [[TMP2]], i32 0
-; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP3]]
-; CHECK-NEXT:    [[TMP5:%.*]] = extractelement <8 x i64> [[TMP2]], i32 1
-; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP5]]
-; CHECK-NEXT:    [[TMP7:%.*]] = extractelement <8 x i64> [[TMP2]], i32 2
-; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP7]]
-; CHECK-NEXT:    [[TMP9:%.*]] = extractelement <8 x i64> [[TMP2]], i32 3
-; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP9]]
-; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <8 x i64> [[TMP2]], i32 4
-; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP11]]
-; CHECK-NEXT:    [[TMP13:%.*]] = extractelement <8 x i64> [[TMP2]], i32 5
-; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP13]]
-; CHECK-NEXT:    [[TMP15:%.*]] = extractelement <8 x i64> [[TMP2]], i32 6
-; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP15]]
-; CHECK-NEXT:    [[TMP17:%.*]] = extractelement <8 x i64> [[TMP2]], i32 7
-; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP17]]
-; CHECK-NEXT:    [[TMP19:%.*]] = load i64, ptr [[TMP4]], align 8
-; CHECK-NEXT:    [[TMP20:%.*]] = load i64, ptr [[TMP6]], align 8
-; CHECK-NEXT:    [[TMP21:%.*]] = load i64, ptr [[TMP8]], align 8
-; CHECK-NEXT:    [[TMP22:%.*]] = load i64, ptr [[TMP10]], align 8
-; CHECK-NEXT:    [[TMP23:%.*]] = load i64, ptr [[TMP12]], align 8
-; CHECK-NEXT:    [[TMP24:%.*]] = load i64, ptr [[TMP14]], align 8
-; CHECK-NEXT:    [[TMP25:%.*]] = load i64, ptr [[TMP16]], align 8
-; CHECK-NEXT:    [[TMP26:%.*]] = load i64, ptr [[TMP18]], align 8
-; CHECK-NEXT:    [[TMP27:%.*]] = insertelement <8 x i64> poison, i64 [[TMP19]], i32 0
-; CHECK-NEXT:    [[TMP28:%.*]] = insertelement <8 x i64> [[TMP27]], i64 [[TMP20]], i32 1
-; CHECK-NEXT:    [[TMP29:%.*]] = insertelement <8 x i64> [[TMP28]], i64 [[TMP21]], i32 2
-; CHECK-NEXT:    [[TMP30:%.*]] = insertelement <8 x i64> [[TMP29]], i64 [[TMP22]], i32 3
-; CHECK-NEXT:    [[TMP31:%.*]] = insertelement <8 x i64> [[TMP30]], i64 [[TMP23]], i32 4
-; CHECK-NEXT:    [[TMP32:%.*]] = insertelement <8 x i64> [[TMP31]], i64 [[TMP24]], i32 5
-; CHECK-NEXT:    [[TMP33:%.*]] = insertelement <8 x i64> [[TMP32]], i64 [[TMP25]], i32 6
-; CHECK-NEXT:    [[TMP34:%.*]] = insertelement <8 x i64> [[TMP33]], i64 [[TMP26]], i32 7
-; CHECK-NEXT:    [[TMP35:%.*]] = add nsw <8 x i64> [[TMP34]], <i64 42, i64 42, i64 42, i64 42, i64 42, i64 42, i64 42, i64 42>
-; CHECK-NEXT:    [[TMP36:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP0]]
-; CHECK-NEXT:    [[TMP37:%.*]] = getelementptr inbounds i64, ptr [[TMP36]], i32 0
-; CHECK-NEXT:    store <8 x i64> [[TMP35]], ptr [[TMP37]], align 8
+; CHECK-NEXT:    [[TMP1:%.*]] = udiv i64 [[TMP0]], 8
+; CHECK-NEXT:    [[TMP2:%.*]] = urem i64 [[TMP1]], 3
----------------
note: this load from A[(iv / 8) % 3] is now recognized as uniform for VF=8.

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_lshr.ll:78
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[TMP2]], align 8
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[TMP3]], i64 0
----------------
note: this load from A[iv >> 1] is now recognized as uniform for VF=2.

Check that it is not considered uniform for VF=4?

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_lshr.ll:136
 ; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[TMP1:%.*]] = lshr <2 x i64> [[VEC_IND]], <i64 2, i64 2>
-; CHECK-NEXT:    [[TMP2:%.*]] = extractelement <2 x i64> [[TMP1]], i32 0
-; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP2]]
-; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <2 x i64> [[TMP1]], i32 1
-; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP4]]
-; CHECK-NEXT:    [[TMP6:%.*]] = load i64, ptr [[TMP3]], align 8
-; CHECK-NEXT:    [[TMP7:%.*]] = load i64, ptr [[TMP5]], align 8
-; CHECK-NEXT:    [[TMP8:%.*]] = insertelement <2 x i64> poison, i64 [[TMP6]], i32 0
-; CHECK-NEXT:    [[TMP9:%.*]] = insertelement <2 x i64> [[TMP8]], i64 [[TMP7]], i32 1
-; CHECK-NEXT:    [[TMP10:%.*]] = add nsw <2 x i64> [[TMP9]], <i64 42, i64 42>
-; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP0]]
-; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i64, ptr [[TMP11]], i32 0
-; CHECK-NEXT:    store <2 x i64> [[TMP10]], ptr [[TMP12]], align 8
+; CHECK-NEXT:    [[TMP1:%.*]] = lshr i64 [[TMP0]], 2
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP1]]
----------------
note: load from A[iv>>2] recognized as uniform for VF=2, should also hold for VF=4.

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_lshr.ll:462

 ; for (iv = 1 ; ; iv += 1) B[iv] = A[iv>>1] + 42;
 define void @ld_lshr1_step1_start1_ind1(ptr noalias %A, ptr noalias %B) {
----------------
note: load from A[1+i>>1] not recognized as uniform for VF=2 due to alignment.

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction2.ll:83

 ; for (iv = 0, iv2 = 0 ; ; iv += 1, iv2 += 1) B[iv] = A[iv/2 + iv2/2] + 42;
 define void @ld_div2_step1_start0_ind2(ptr noalias %A, ptr noalias %B) {
----------------
note: load from A[iv/2 + iv2/2] i.e. A[2*(iv/2)] recognized as uniform for VF=2, but should not for VF > 2.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148841/new/

https://reviews.llvm.org/D148841