[llvm] [AArch64][LoopVectorize] Enable tail-folding on neoverse-v2 (PR #135357)

Mon Apr 14 00:24:46 PDT 2025

davemgreen wrote:

I should have given more details but there are quite a few things going on with it and some things could have changes since last we looked. As far as I understand the vectorizer still makes the decision very early as to whether to tail fold, and if we return true for preferPredicateOverEpilogue then we essentially force the vectorizer to predicate the loop we see. For AArch64 I believe this currently still means forcing scalable vectorization (as masked-loads/stores are not given a cheap cost for fixed length vectors), and forces the interleave factor to 1 as it is difficult to predicate loops well that are also unrolled.

In recent times on Neoverse V2 there has been a push in the opposite directions. It has preferred fixed-width to scalable vectors when the costs are equal (#95819) and allow larger vector bodies to make use of all the vector pipelines available on the V2 (#100385).

Tail predication has some efficiency bonuses of its own especially for loops with low trip counts that are called often, but it makes it difficult to get the most out of the hardware for loops with high trip counts. Saturating 4 vector pipelines sometimes requires some interleaving and making sure that the predication does not become a bottleneck. So whilst this might help on certain benchmark it can hurt in other domains like ML, HPC and DSP. (We know for example that x264 has certain low-trip count loops that can be helped by forcing the vectorizer to pick a lower trip count). Currently there are some heuristics to disable reductions, small loops and a few other cases when tail folding, but the problem AFAIU isn't really predication+reductions or predication+small loops, it is that interleaving can be so important for performance in these loops.

The way GCC approaches this is to generate a fast unpredicated loop with some interleaving, and use a predicated remainder to handle the tail. In VPlan this would mean that it generated multiple vplans with and without predication and costed them against one another. So long as it had a way to detect bottlenecks in the loop, it should then be able to produce the big unpredicated vector body with predicated remainder version where it will be beneficial, otherwise choosing to tail fold where that was more efficient. This requires the loop vectorizer to not opt into tail predication so early, which might still require some quite major surgery.

So maybe the tuning is just right for V2 and this doesn't conflict with #95819 and #100385, but I worry it will currently make some cases better and some (important, high trip count) case worse, limiting the top-end performance when we want things to go as fast as they can.

https://github.com/llvm/llvm-project/pull/135357