[PATCH] D109368: [LV] Don't vectorize if we can prove RT + vector cost >= scalar cost.

Wed Sep 29 01:49:43 PDT 2021

dmgreen added inline comments.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:449
+                      ProfileSummaryInfo *PSI, GeneratedRTChecks &RTChecks,
+                      ElementCount MinProfTripCount)
+      : OrigLoop(OrigLoop), PSE(PSE), LI(LI), DT(DT), TLI(TLI), TTI(TTI),
----------------
fhahn wrote:
> ebrevnov wrote:
> > dmgreen wrote:
> > > Can this have a default value, to prevent the need for multiple constructors?
> > Since 'minimum profitable trip count' is part of VectorizationFactor and InnerLoopVectorizer should know both VecWidth and MinProfTripCount I would suggest passing that information as single VectorizationFactor argument.
> > Can this have a default value, to prevent the need for multiple constructors?
> 
> unfortunately that's not possible with the currently available constructors. But I made it a required argument and updated the callers to avoid the extra constructor.
> 
> > Since 'minimum profitable trip count' is part of VectorizationFactor and InnerLoopVectorizer should know both VecWidth and MinProfTripCount I would suggest passing that information as single VectorizationFactor argument.
> 
> That would be good, but unfortunately I think the epilogue vectorizer instantiation only has access to an ElementCount for now :( Can the threaded through as follow-up
> unfortunately that's not possible with the currently available constructors. But I made it a required argument and updated the callers to avoid the extra constructor.

I was expecting it to use `= ElementCount()`, but this sounds OK too. Is the InnerLoopUnroller value deliberately 1, or would passing zero be better? I imagine it doesn't make much difference in practice.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:8226
+    //  The total cost of the vector loop is
+    //    RtC + VecC * (TC / VF) + EpiC
+    //  where
----------------
fhahn wrote:
> ebrevnov wrote:
> > dmgreen wrote:
> > > This should be `RtC + VecC * floor(TC / VF) + EpiC` or `RtC + VecC * ceil(TC / VF)` when folding the tail (assuming there is no epilogue then). That makes the math more difficult unless it makes the assumption that those round to the same thing.
> > > 
> > > It is also assuming that the runtime checks do not fail, otherwise a probability factor would need to be added. (Probably fine to assume, but the more runtime checks there are, the less likely they are to all succeed).
> > > 
> > > And there can be other costs for vectorizations. The "check N is less than MinProfTripCount" isn't free and there can be other inefficiencies from vector loops.
> > > This should be `RtC + VecC * floor(TC / VF) + EpiC` or `RtC + VecC * ceil(TC / VF)` when folding the tail (assuming there is no epilogue then).
> > Since we already compute upper estimate for MinTC1 it doesn't seem to be necessary to do additional adjustment when folding tail.  But we probably want/need to do an adjustment for "requiresScalarEpilogue" case.
> > This should be RtC + VecC * floor(TC / VF) + EpiC or RtC + VecC * ceil(TC / VF) when folding the tail (assuming there is no epilogue then). That makes the math more difficult unless it makes the assumption that those round to the same thing.
> 
> I added a statement below about the fact that the computations are performed on doubles and later rounded up, giving an upper bound estimate as @ebrevnov suggested. Do you think that's sufficient?
> 
> 
> > It is also assuming that the runtime checks do not fail, otherwise a probability factor would need to be added. (Probably fine to assume, but the more runtime checks there are, the less likely they are to all succeed).
> 
> Yep, that's a fundamental assumption at the moment. Unfortunately I cannot think of a good way to estimate the probability of the checks passing. If we assign a fixed probability per runtime check we are likely ending up with a hard limit like we have at the moment, just expressed differently.
> 
> The main motivation of `MinTC2` below is to introduce a limit on the impact of a (large) number of runtime checks. The main goal is preventing evaluating large runtime checks for short running loops (at the moment we allow increasing total runtime by 10% due to failing runtime checks, but this could also be lower).
> 
> While there might be additional cases where failing runtime checks cause increase in runtime, the same problem already exists even with the hard-coded limit we have at the moment.
> 
> We could also change to way we emit runtime checks slightly and break them up across multiple blocks with earlier exits to increase the chances we do not have to evaluate all runtime checks if some fail.
> 
> > And there can be other costs for vectorizations. The "check N is less than MinProfTripCount" isn't free and there can be other inefficiencies from vector loops.
> 
> Agreed, this can be an unfortunate side effect. Again, this is a problem we are already hitting and this patch will add a bit more vectorized loops. But I think in general the impact on the number of loops vectorized with this patch should be relatively small (forSPEC2006/SPEC2017/MultiSource ~1% more loops are vectorized). And I think unfortunately there's not much we can do to avoid this check in general.
> 
> One follow-up I think that becomes important is to make sure that we try to use PGO to detect cases where we create dead vector loops and skip vectorizing them.
As far as I understand (correct me if I'm wrong!) we are essentially changing from code that looked like:
```
if (N < VF) {
  if (!runtimechecks)
    goto scalar loop
  vector loop; n -= VF
scalar loop
```
To the same code with, but with a different initial guard value and potentially more runtime checks in places:
```
if (N < MinProfitableTripCount) {
  if (!runtimechecks)
    goto scalar loop
  vector loop; n -= VF
scalar loop
```

That means that if we under-estimate MinProfitableTripCount we go into the runtime checks/vector loop, potentially executing a lot of expensive runtime checks where it is not profitable.
If we _over_ estimate the MinProfitableTripCount then at runtime we will not execute the vector code, falling back to the scalar loop. So we have generated larger/less efficient scalar code that then never executes the vector part, even if it would be profitable to do so.

So we end up in the unfortunate place where either over or under estimating the cost can lead to inefficiencies.

I'm not too worried about the details here. They sounds fine for the most part so long as they are close enough. I'm more worried about the cost of the runtime checks being over-estimated due to them being unsimplified prior to costing. I think that is where the worst regressions I am seeing from this patch come from. Loops where vector code was previously generated and executed are now skipped over. Unfortunately loops with lowish trip counts are common in a lot of code :)

The code in LoopVectorizationCostModel::isMoreProfitable already talks about the cost in terms of `PerIterationCost*ceil(TripCount/VF)` vs `PerIterationCost*floor(TC/VF)` though, and I would recommend describing things in the same way here, explaining that `RtC + VecC * (TC / VF) + EpiC` is a simplification of that. 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109368/new/

https://reviews.llvm.org/D109368