[PATCH] D67905: [LV] Vectorizer should adjust trip count in profile information

Wed Nov 20 03:50:50 PST 2019

Ayal added a comment.

Adding a few comments. Would be good to generalize and apply also to loop unroll (and jam).

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3989
+                                           OrigFallThroughCount) ||
+      OrigFallThroughCount == 0)
+    return;
----------------
OrigFallThroughCount can still be either the exit count or the continue-to-next-iteration count, according to the code below.  Wait to test if its zero until we know what it stands for?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4001
+  if (!IsTrueBackEdgeOrigLoop)
+    std::swap(OrigTakenCount, OrigFallThroughCount);
+
----------------
Better use distinct names, e.g., OrigExitCount and OrigBackedgeTakenCount, than continue to call them Taken and FallThrough. Perhaps use Weight instead of Count, to denote total profile frequencies, as the latter is used elsewhere to denote the actual per-invocation TripCount.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4004
+  // Calculate number of iterations in original scalar loop.
+  // Note: Uses of OrigIterCount bellow should not be simplified as it will
+  // produce a different value. In other words: (A mod N) * B != (A*B) mod N
----------------
bel[l]ow

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4006
+  // produce a different value. In other words: (A mod N) * B != (A*B) mod N
+  const uint64_t OrigIterCount = OrigTakenCount / OrigFallThroughCount + 1;
+  // Calculate number of iterations in vector loop.
----------------
How about "OrigAverageTripCount"?

Explanation about its computation:
OrigAverageTripCount = (number of times header block was executed) / (number of times header was reached from pre-header == number of times latch exited)
 == (OrigTakenCount + OrigFallThroughCount) / OrigFallThroughCount
 == OrigTakenCount / OrigFallThroughCount + 1.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4008
+  // Calculate number of iterations in vector loop.
+  uint64_t VecIterCount = (OrigIterCount / (VF * UF));
+  // Calculate number of iterations for prolog/epilog loop.
----------------
How about VecAverageTripCount = OrigAverageTripCount / (VF * UF);

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4012
+  // Adjust number of iterations in vector and epilog loops if epilog
+  // iterations executed as part of the main loop.
+  if (PEIterCount != 0 && Cost->foldTailByMasking()) {
----------------
Just to clarify, maintaining branch frequencies through optimizations is best-effort and imprecise - a total weight that does not divide VF*UF implies that the trip count of at-least one invocation did not divide VF*UF, not necessarily all of them; w/o considering also the distribution of trip counts in addition to their sum.

Setting PRIterCount = 0 and VecAverageTipCount = round(OrigAverageTripCount / (VF*UV)) when Cost->foldTailByMasking() is probably the best that can be done. The former is redundant given that it applies to dead code, and the latter should perhaps apply to all cases, in general.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4016
+    PEIterCount = 0;
+  }
+
----------------
There's also the special case of requiresScalarEpiloque() where 0 < PEIterCount <= VF*UF for each invocation of the loop, and hence the average is also strictly positive FWIW. But best keep the approximation general instead of trying to improve it, given general lack of information.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4023
+    VecTakenCount = (VecIterCount - 1) * OrigFallThroughCount;
+    VecFallThrough = OrigFallThroughCount;
+  }
----------------
This assumes the number of times the vector loop will be reached is equal to the number of times the original scalar loop was reached (OrigFallThrougCount). This holds is Cost->foldTailByMasking(), but otherwise invocations whose trip count < VF*UF will bypass the vector loop (and also == VF*UF if requireScalarEpilogue()), plus other run time guards.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:4032
+    PEFallThroughCount = OrigFallThroughCount;
+  }
+
----------------
Similar to above comment, invocations whose trip count divides VF*UF will bypass the scalar remainder loop (w/o foldTailByMasking nor requireScalarEpilogue), so in general PEFallThroughCount <= OrigFallThroughCount.

================
Comment at: llvm/test/Transforms/LoopVectorize/check-prof-info.ll:3
+; RUN: opt  -passes="print<block-freq>,loop-vectorize" -force-vector-width=4 -force-vector-interleave=1 -S < %s |  FileCheck %s
+; RUN: opt  -passes="print<block-freq>,loop-vectorize" -prefer-predicate-over-epilog=true -force-vector-width=4 -force-vector-interleave=1 -S < %s |  FileCheck %s -check-prefix=CHECK-MASKED
+
----------------
May want to also check with UF>1.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D67905/new/

https://reviews.llvm.org/D67905