[PATCH] D115713: [LV] Don't apply "TinyTripCountVectorThreshold" for loops with compile time known TC.

Sun Nov 6 13:21:30 PST 2022

dmgreen added a comment.

I've been taking a look at the example that is getting a lot worse. There are certainly some issues with the code generation being non-optimal, but even after a lot of optimizations it looks like it will always be worse than the scalar version. There is a lot of predications and fairly efficient scalar instructions like BFI, which makes accurate cost modelling difficult. There's a lot of setup and spilling too, which is going to hurt for small trip counts. I think for MVE it would make sense to have a way for the target to put a limit on the minumum trip count.

I think @fhahn also mentioned that he had some AArch64 examples where the same is true. I'm not sure in general where this would be useful. The vectorizers handing of small trip count loops is not amazing, considering that many such loops will already have been fully unrolled. It doesn't come up a huge amount and a lot of the cost modelling currently assumes any extra setup costs will be dominated by the loop.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:9870-9875
+      // If trip count is known compile time constant there is no any extra
+      // overhead from vectorization with scalar epilogue. Let cost model to
+      // decide on profitability in that case.
+      auto TC = SE.getSmallConstantTripCount(L);
+      if (!TC) {
+        if (*ExpectedTC <= TTI->getMinTripCountTailFoldingThreshold()) {
----------------
ebrevnov wrote:
> dmgreen wrote:
> > ebrevnov wrote:
> > > paulwalker-arm wrote:
> > > > I don't quite understand this change.  The whole point of `getMinTripCountTailFoldingThreshold()` was to give targets control over this behaviour based on their understanding how the cost model has been implemented.
> > > > 
> > > > Admittedly this was in part due to the immaturity of the cost modelling but this change essentially removes that flexibility to the point where there's no value in keeping `getMinTripCountTailFoldingThreshold()`?
> > > > 
> > > > If your previous patches improve the cost model in this regard then I'd rather `getMinTripCountTailFoldingThreshold()` be removed.  That said, @dtemirbulatov can you help here in ascertaining if this option is still required based on this new patch series?
> > > I think you are missing one thing. Currently we allow vectorization of short trip count loops if there is  no run-time overhead and the tail can be folded only. Introduction of getMinTripCountTailFoldingThreshold put additional limitations on vectorization of such loops. This change enables vectorization of short trip count loops with scalar epilogue ONLY If trip count is compile time constant. getMinTripCountTailFoldingThreshold is still applicable if we decide to vectorize by folding the tail.
> > > 
> > > In other words, this change enables exactly the same way of vectorization for loops with small constant trip count as for loops with unknown or large trip count.
> > > 
> > > It's a question whether getMinTripCountTailFoldingThreshold threshold should be taken into account if it's decided to fold the tail at step 5) for short trip count loop. Possibly yes...  I think this is your original concern, right?
> > Are you assuming that the loop will be unrolled after vectorization, and that unrolling will leave no extra overhead? Scalable vector loops cannot (currently) be unrolled.
> Actually no. Even if we keep vector loop which runs just 1 or 2 iterations I don't see any additional overhead comparing to scalar loop. At least not on IR level. Is there any overhead specific to MVE not visible on IR level?
The getMinTripCountTailFoldingThreshold was added for SVE, it is not (currently) used in MVE.

There will be a certain amount of overhead just from having a second loop, let alone all the extra overhead from the gubbins around it. https://godbolt.org/z/s8xEf7d6K. None of that is particularly MVE specific though, although as it is designed for small energy-efficient cores it may feel the overheads more than other architectures.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D115713/new/

https://reviews.llvm.org/D115713