[PATCH] D109368: [LV] Don't vectorize if we can prove RT + vector cost >= scalar cost.

Mon Sep 13 03:26:24 PDT 2021

ebrevnov added a comment.

I realized that doing similar thing in the cost model requires a bit of preparational work (about 6 patches). Due to that I think it may be reasonable to land this first. Please find my comments inlined. One thing I would like to ask you. In order to simplify merging with the mentioned 6 patches would be nice if you rebase your work on D109443 <https://reviews.llvm.org/D109443> (hopefully can be landed quickly) and take D109444 <https://reviews.llvm.org/D109444> (except line 10182) as part of this change.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8187
   // Check if it is profitable to vectorize with runtime checks.
+  if (!ForceVectorization && SelectedVF.Width.getKnownMinValue() > 1) {
+    if (auto ExpectedTC = getSmallBestKnownTC(*PSE.getSE(), OrigLoop)) {
----------------
Would it be better to use !SelectedVF.Width.isScalar() instead. That will make it obvious that RTCost should be applied to vector loops only.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8188
+  if (!ForceVectorization && SelectedVF.Width.getKnownMinValue() > 1) {
+    if (auto ExpectedTC = getSmallBestKnownTC(*PSE.getSE(), OrigLoop)) {
+      InstructionCost RTCost = Checks.getCost(CM);
----------------
In general, I would like us to agree on the strategy to follow when number of iterations is not known at compile time.
Currently, I see the following inconsistency in use of getSmallBestKnownTC. On the one hand, it may return upper bound estimate (SE.getSmallConstantMaxTripCount) which may greatly overestimate real trip count.  On the over hand, it returns None if SE.getSmallConstantMaxTripCount didn't manage to get an estimate and we will end up skipping the check entirely.
I see 3 possible ways to follow in the case of unknow trip count:
1) Don't check RT overhead if trip count is unknown at compile time. This is most conservative solution.
2) Assume maximum possible number of iterations. In practice that will essentially give the same result as 1) (because RTCost / double(*ExpectedTC) would be less than 1 ). This is essentially what this patch does. In addition we would need to take max value for None as well
3) Assume some fixed reasonable average number of iterations (may be extended to a more complex heuristic in future).

While in theory 3) is more general and may give better estimate it mat require panful tuning. Given that I would vote for 1) at this moment as the most conservative.

Note also regardless of chosen way to go we should decide how to categorize trip count deduced from profile. I personally think we should consider it same way as known trip count.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8191
+      // The total scalar cost is ScalarCost * ExpectedTC and the total vector
+      // cost is (VectorCost / Width) * ExpectedTC. To avoid dividing by a small
+      // number, we multiply ScalarCost * Width instead. To avoid multiplying
----------------
In general case , total vector cost is "PrologCost + VectorCost*(ExpectedTC/Width) + EpilogCost, where (ExpectedTC/Width) is an integer division. RTCost is just a part of prolog cost. I think it is worth mentioning. Then we should explain why/how it reduces to "RTCost + VectorCost* (ExpectedTC/Width)". Thus we end up with the check "ScalarCost * ExpectedTC <= RTCost + VectorCost* (ExpectedTC/Width)". Division by  ExpectedTC (using FP) makes sense to me and should give acceptable accuracy. Multiplication by 'Width' may give up to 'Width - 1' error. For that reason I would avoid doing multiplication by 'Width'. Thus we end up with "ScalarCost <= RTCost/(double)ExpectedTC + VectorCost/Width".

There are still two cases not taken into account (foldTailByMasking() and equiresScalarEpilogue()) but I think it's OK for this type of check.

Makes sense?

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8200
+                          << ScalarCost << ") <= runtime check + vector cost ("
+                          << (RTCost / double(*ExpectedTC) + SelectedVF.Cost)
+                          << ")\n");
----------------
I would suggest keeping result in a variable to avoid coping the (possibly non-trivial) formula.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109368/new/

https://reviews.llvm.org/D109368