[PATCH] D148841: [LV] Use SCEV for uniformity analysis across VF

Sun May 28 12:04:05 PDT 2023

fhahn added inline comments.

================
Comment at: llvm/include/llvm/Analysis/LoopAccessAnalysis.h:591-592

-  /// Returns true if the value V is uniform within the loop.
-  bool isUniform(Value *V) const;
+  /// Returns true if the value V is uniform within the loop. If \p VF is
+  /// provided, check if \p V is uniform across \p VF.
+  bool isUniform(Value *V, std::optional<ElementCount> VF = std::nullopt) const;
----------------
Ayal wrote:
> 
Clarified, thanks!

================
Comment at: llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h:351
+  /// Returns true if the value V is uniform within the loop. If \p VF is
+  /// provided, check if \p V is uniform across \p VF.
+  bool isUniform(Value *V, std::optional<ElementCount> VF = std::nullopt) const;
----------------
Ayal wrote:
> ditto
updated, thanks!

================
Comment at: llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h:354-356
   /// A uniform memory op is a load or store which accesses the same memory
-  /// location on all lanes.
-  bool isUniformMemOp(Instruction &I) const;
+  /// location on all lanes.  If \p VF is provided, check if \p I is uniform
+  /// across \p VF.
----------------
Ayal wrote:
> nit: w/o VF, uniformity falls back to loop invariance.
Updated with similar wording to isUniform, thanks!

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2543
+namespace {
+/// Rewriter to replace AddRecs for TheLoop with new AddRecs where the step is
+/// multiplied by StepMultiplier and Offset * Step is added. Also checks if all
----------------
Ayal wrote:
> "Rewriter is designed to build the SCEVs for each of the VF lanes in the expected vectorized loop, which can then be compared to detect their uniformity.
> This is done by replacing the AddRec SCEVs of the original scalar loop with new AddRecs ..."
> 
> Should the name "SCEVAddRecRewriter" convey what it's for?
Adjusted the comment, thanks! 

Updated name to `SCEVAddRecForUniformityRewriter`

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2561
+  /// Set to true if the expression contains an UDiv expression. This is used to
+  /// limit the number of analyzed expressions for compile-time reasons.
+  bool FoundUDiv = false;
----------------
Ayal wrote:
> nit: explain why a non-loop-invariant uniform value is expected to involve a UDiv. Would an SDiv also work?
> This saves time by potentially bailing out after building a single (UDiv-free) expression for FirstLane, w/o building another expression for another lane, rather than saving in building an expression itself.
Expanded the comment, thanks!

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2572
+    assert(Expr->getLoop() == TheLoop &&
+           "addrec outside of TheLoop must be invariant");
+    // Build a new AddRec by multiplying the step by StepMultiplier and
----------------
Ayal wrote:
> nit: right, point (of error message) is that such addrec's should have been checked earlier?
Extended message to try to make this clear.

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2600
+    // The value could vary across iterations.
+    CannotAnalyze = true;
+    return S;
----------------
Ayal wrote:
> nit: return a SCEVCouldNotCompute instead, SCEV's inherent 'CannotAnalyze'?
Unfortunately that doesn't work without additional work, as the returned value may be used to construct the parent SCEV but the rewriter.

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2610
+
+  bool canAnalyze() const { return FoundUDiv && !CannotAnalyze; }
+};
----------------
Ayal wrote:
> nit: can return SCEVCouldNotCompute (or null) at the end if not FoundUDiv.
Unfortunately that doesn't work without additional work, as the returned value may be used to construct the parent SCEV but the rewriter.

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2631
+  // Rewrite AddRecs in TheLoop to step by VF and create 2 expressions, one
+  // computing the value of the first and one for the last vector lane. Consider
+  // V uniform if the SCEV expressions are equal.
----------------
Ayal wrote:
> Update comment, fold LastLane into an additional VF-1 iteration of the loop, drop the max(1,VF-2).
Simplified the code, thanks!

================
Comment at: llvm/lib/Analysis/LoopAccessAnalysis.cpp:2649
+      auto *IthLaneExpr = IthLaneRewriter.visit(S);
+      if (!IthLaneRewriter.canAnalyze() || FirstLaneExpr != IthLaneExpr)
+        return false;
----------------
Ayal wrote:
> nit: suffice to check that SCEVs are equal?
Simplified, thanks!

================
Comment at: llvm/test/Transforms/LoopVectorize/X86/uniform_mem_op.ll:427
+; CHECK-NEXT:    [[TMP39]] = add <4 x i32> [[VEC_PHI6]], [[TMP35]]
+; CHECK-NEXT:    [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 16
+; CHECK-NEXT:    [[VEC_IND_NEXT]] = add <4 x i64> [[STEP_ADD2]], <i64 4, i64 4, i64 4, i64 4>
----------------
Ayal wrote:
> note: VF=4 is mandated, previous UF=2 decision with 8 loads is now UF=4 with 4 (uniform) loads & broadcasts.
> 
> The load from test_base[i/8] could further fold two 'parts' of VF=4 together, as it is uniform across VF=8 / VF=4,UF=2.
> Worth leaving behind some **assume**, if not folding directly? I.e., record the largest VF for which uniformity was detected, even if a smaller VF is selected.
> Worth optimizing the analysis across VF's, i.e., if a value is known not to be uniform for VF avoid checking for 2*VF? OTOH, if a value is known to be uniform for VF, check only two SCEVs for 2*VF?
> The load from test_base[i/8] could further fold two 'parts' of VF=4 together, as it is uniform across VF=8 / VF=4,UF=2.

I think that would be good as follow-up.

> Worth optimizing the analysis across VF's, i.e., if a value is known not to be uniform for VF avoid checking for 2*VF

I was thinking about evaluating something like that as follow-up optimization. WDYT?

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_and.ll:78
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[TMP2]], align 8
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[TMP3]], i64 0
----------------
Ayal wrote:
> note: this load from A[iv & -2] is now recognized as uniform across VF=2.
Added as a comment, thanks!

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_div_urem.ll:6

 ; for (iv = 0 ; ; iv += 1) B[iv] = A[(iv/2)%3];
 define void @ld_div2_urem3_1(ptr noalias %A, ptr noalias %B) {
----------------
Ayal wrote:
> note: load from A[(iv/2)%3] rightfully not recognized as uniform for VF=8.
Added comment, thanks!

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_div_urem.ll:294
 ; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[TMP1:%.*]] = udiv <8 x i64> [[VEC_IND]], <i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8, i64 8>
-; CHECK-NEXT:    [[TMP2:%.*]] = urem <8 x i64> [[TMP1]], <i64 3, i64 3, i64 3, i64 3, i64 3, i64 3, i64 3, i64 3>
-; CHECK-NEXT:    [[TMP3:%.*]] = extractelement <8 x i64> [[TMP2]], i32 0
-; CHECK-NEXT:    [[TMP4:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP3]]
-; CHECK-NEXT:    [[TMP5:%.*]] = extractelement <8 x i64> [[TMP2]], i32 1
-; CHECK-NEXT:    [[TMP6:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP5]]
-; CHECK-NEXT:    [[TMP7:%.*]] = extractelement <8 x i64> [[TMP2]], i32 2
-; CHECK-NEXT:    [[TMP8:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP7]]
-; CHECK-NEXT:    [[TMP9:%.*]] = extractelement <8 x i64> [[TMP2]], i32 3
-; CHECK-NEXT:    [[TMP10:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP9]]
-; CHECK-NEXT:    [[TMP11:%.*]] = extractelement <8 x i64> [[TMP2]], i32 4
-; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP11]]
-; CHECK-NEXT:    [[TMP13:%.*]] = extractelement <8 x i64> [[TMP2]], i32 5
-; CHECK-NEXT:    [[TMP14:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP13]]
-; CHECK-NEXT:    [[TMP15:%.*]] = extractelement <8 x i64> [[TMP2]], i32 6
-; CHECK-NEXT:    [[TMP16:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP15]]
-; CHECK-NEXT:    [[TMP17:%.*]] = extractelement <8 x i64> [[TMP2]], i32 7
-; CHECK-NEXT:    [[TMP18:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP17]]
-; CHECK-NEXT:    [[TMP19:%.*]] = load i64, ptr [[TMP4]], align 8
-; CHECK-NEXT:    [[TMP20:%.*]] = load i64, ptr [[TMP6]], align 8
-; CHECK-NEXT:    [[TMP21:%.*]] = load i64, ptr [[TMP8]], align 8
-; CHECK-NEXT:    [[TMP22:%.*]] = load i64, ptr [[TMP10]], align 8
-; CHECK-NEXT:    [[TMP23:%.*]] = load i64, ptr [[TMP12]], align 8
-; CHECK-NEXT:    [[TMP24:%.*]] = load i64, ptr [[TMP14]], align 8
-; CHECK-NEXT:    [[TMP25:%.*]] = load i64, ptr [[TMP16]], align 8
-; CHECK-NEXT:    [[TMP26:%.*]] = load i64, ptr [[TMP18]], align 8
-; CHECK-NEXT:    [[TMP27:%.*]] = insertelement <8 x i64> poison, i64 [[TMP19]], i32 0
-; CHECK-NEXT:    [[TMP28:%.*]] = insertelement <8 x i64> [[TMP27]], i64 [[TMP20]], i32 1
-; CHECK-NEXT:    [[TMP29:%.*]] = insertelement <8 x i64> [[TMP28]], i64 [[TMP21]], i32 2
-; CHECK-NEXT:    [[TMP30:%.*]] = insertelement <8 x i64> [[TMP29]], i64 [[TMP22]], i32 3
-; CHECK-NEXT:    [[TMP31:%.*]] = insertelement <8 x i64> [[TMP30]], i64 [[TMP23]], i32 4
-; CHECK-NEXT:    [[TMP32:%.*]] = insertelement <8 x i64> [[TMP31]], i64 [[TMP24]], i32 5
-; CHECK-NEXT:    [[TMP33:%.*]] = insertelement <8 x i64> [[TMP32]], i64 [[TMP25]], i32 6
-; CHECK-NEXT:    [[TMP34:%.*]] = insertelement <8 x i64> [[TMP33]], i64 [[TMP26]], i32 7
-; CHECK-NEXT:    [[TMP35:%.*]] = add nsw <8 x i64> [[TMP34]], <i64 42, i64 42, i64 42, i64 42, i64 42, i64 42, i64 42, i64 42>
-; CHECK-NEXT:    [[TMP36:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP0]]
-; CHECK-NEXT:    [[TMP37:%.*]] = getelementptr inbounds i64, ptr [[TMP36]], i32 0
-; CHECK-NEXT:    store <8 x i64> [[TMP35]], ptr [[TMP37]], align 8
+; CHECK-NEXT:    [[TMP1:%.*]] = udiv i64 [[TMP0]], 8
+; CHECK-NEXT:    [[TMP2:%.*]] = urem i64 [[TMP1]], 3
----------------
Ayal wrote:
> note: this load from A[(iv / 8) % 3] is now recognized as uniform for VF=8.
added comment

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_lshr.ll:78
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP1]]
+; CHECK-NEXT:    [[TMP3:%.*]] = load i64, ptr [[TMP2]], align 8
+; CHECK-NEXT:    [[BROADCAST_SPLATINSERT:%.*]] = insertelement <2 x i64> poison, i64 [[TMP3]], i64 0
----------------
Ayal wrote:
> note: this load from A[iv >> 1] is now recognized as uniform for VF=2.
> 
> Check that it is not considered uniform for VF=4?
Add check lines for VF=4 as well separately.

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_lshr.ll:136
 ; CHECK-NEXT:    [[TMP0:%.*]] = add i64 [[INDEX]], 0
-; CHECK-NEXT:    [[TMP1:%.*]] = lshr <2 x i64> [[VEC_IND]], <i64 2, i64 2>
-; CHECK-NEXT:    [[TMP2:%.*]] = extractelement <2 x i64> [[TMP1]], i32 0
-; CHECK-NEXT:    [[TMP3:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP2]]
-; CHECK-NEXT:    [[TMP4:%.*]] = extractelement <2 x i64> [[TMP1]], i32 1
-; CHECK-NEXT:    [[TMP5:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP4]]
-; CHECK-NEXT:    [[TMP6:%.*]] = load i64, ptr [[TMP3]], align 8
-; CHECK-NEXT:    [[TMP7:%.*]] = load i64, ptr [[TMP5]], align 8
-; CHECK-NEXT:    [[TMP8:%.*]] = insertelement <2 x i64> poison, i64 [[TMP6]], i32 0
-; CHECK-NEXT:    [[TMP9:%.*]] = insertelement <2 x i64> [[TMP8]], i64 [[TMP7]], i32 1
-; CHECK-NEXT:    [[TMP10:%.*]] = add nsw <2 x i64> [[TMP9]], <i64 42, i64 42>
-; CHECK-NEXT:    [[TMP11:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[TMP0]]
-; CHECK-NEXT:    [[TMP12:%.*]] = getelementptr inbounds i64, ptr [[TMP11]], i32 0
-; CHECK-NEXT:    store <2 x i64> [[TMP10]], ptr [[TMP12]], align 8
+; CHECK-NEXT:    [[TMP1:%.*]] = lshr i64 [[TMP0]], 2
+; CHECK-NEXT:    [[TMP2:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[TMP1]]
----------------
Ayal wrote:
> note: load from A[iv>>2] recognized as uniform for VF=2, should also hold for VF=4.
added comment, thanks!

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction1_lshr.ll:462

 ; for (iv = 1 ; ; iv += 1) B[iv] = A[iv>>1] + 42;
 define void @ld_lshr1_step1_start1_ind1(ptr noalias %A, ptr noalias %B) {
----------------
Ayal wrote:
> note: load from A[1+i>>1] not recognized as uniform for VF=2 due to alignment.
Added note, thanks!

================
Comment at: llvm/test/Transforms/LoopVectorize/uniform_across_vf_induction2.ll:83

 ; for (iv = 0, iv2 = 0 ; ; iv += 1, iv2 += 1) B[iv] = A[iv/2 + iv2/2] + 42;
 define void @ld_div2_step1_start0_ind2(ptr noalias %A, ptr noalias %B) {
----------------
Ayal wrote:
> note: load from A[iv/2 + iv2/2] i.e. A[2*(iv/2)] recognized as uniform for VF=2, but should not for VF > 2.
added note, thanks!

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D148841/new/

https://reviews.llvm.org/D148841