[PATCH] D130618: [AArch64][LoopVectorize] Enable tail-folding by default for SVE

Wed Aug 3 05:34:19 PDT 2022

david-arm added a comment.

In D130618#3696283 <https://reviews.llvm.org/D130618#3696283>, @dmgreen wrote:

> Can you explain more about why reductions are a problem for certain cpus? What about the cortex-a510? And if it being disabled for all these cpus, should it be disabled for -mcpu=generic too?  I'm not sure why we would disable sve reductions though - first-order-recurrences make more sense, but that might be something that is better is done in general, not per-subtarget.
>
> And is tail-folding expected to be beneficial in general? As far as I can see it might currently be losing the interleaving, which can be important for performance. And it should ideally not be altering the NEON codegen, if that could be preferable. Is this currently one option that alters both scalable and fixed width vectorization?

Hi @dmgreen, through benchmarking and performance analysis we discovered that on some cores (neoverse-v1, x2, a64fx) if you use tail-folding with reductions the performance is significantly worse (i.e. 80-100% slower on some loops!) than using unpredicated vector loops. This was observed consistently across a range of loops, benchmarks and CPUs, although we don't know exactly why. Our best guess is that it's to do with the chain of loop-carried dependencies in the loops, i.e. reduction PHI + scalar IV + loop predicate PHI. So it's absolutely critical that we avoid signficant regressions for benchmarks that contain reductions or first-order recurrences and this patch is a sort of a compromise. If you don't specify the CPU then we will follow architectural intent and always tail-fold for all loops, but when targeting CPUs with this issue we disable tail-folding for such loops.

In general, tail-folding is beneficial for reducing code size and mopping up the scalar tail, as well following the intentions of the architecture. For example, x264 in SPEC2k17 sees 6-7% performance improvements on neoverse-v1 CPUs due to the low trip counts in hot loops.

With regards to interleaving, the fundamental problem lies with how we do tail-folding in the loop vectoriser, which forces us to make cost-based decisions about whether to use tail-folding or not before we've calculated any loop costs. Enabling tail-folding has consequences because suddenly your loops become predicated and the costs change accordingly. For example, NEON does not support masked interleaved memory accesses, so enabling tail-folding leads to insane fixed-width VF costs. At the same time the loop vectoriser does not support vectorising interleaved memory accesses for scalable vectors either, so we end up in a situation where the vectoriser decides not to vectorise at all! Whereas if we don't enable tail-folding we will vectorise using a fixed-width VF and use NEON's ld2/st2/etc instructions, which is often still faster than a scalar loop. Ultimately in the long term we would like to change the loop vectoriser to consider a matrix of costs, with vectorisation style on one axis and VF on the other, then choose the most optimal cost in that matrix. But this is a non-trivial piece of work, so in the short term we opted for this temporary solution.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D130618/new/

https://reviews.llvm.org/D130618