[PATCH] D130618: [AArch64][LoopVectorize] Enable tail-folding by default for SVE

Thu Aug 4 07:25:02 PDT 2022

dmgreen added a comment.

> Hi @dmgreen, through benchmarking and performance analysis we discovered that on some cores (neoverse-v1, x2, a64fx) if you use tail-folding with reductions the performance is significantly worse (i.e. 80-100% slower on some loops!) than using unpredicated vector loops. This was observed consistently across a range of loops, benchmarks and CPUs, although we don't know exactly why. Our best guess is that it's to do with the chain of loop-carried dependencies in the loops, i.e. reduction PHI + scalar IV + loop predicate PHI. So it's absolutely critical that we avoid signficant regressions for benchmarks that contain reductions or first-order recurrences and this patch is a sort of a compromise. If you don't specify the CPU then we will follow architectural intent and always tail-fold for all loops, but when targeting CPUs with this issue we disable tail-folding for such loops.
>
> In general, tail-folding is beneficial for reducing code size and mopping up the scalar tail, as well following the intentions of the architecture. For example, x264 in SPEC2k17 sees 6-7% performance improvements on neoverse-v1 CPUs due to the low trip counts in hot loops.
>
> With regards to interleaving, the fundamental problem lies with how we do tail-folding in the loop vectoriser, which forces us to make cost-based decisions about whether to use tail-folding or not before we've calculated any loop costs. Enabling tail-folding has consequences because suddenly your loops become predicated and the costs change accordingly. For example, NEON does not support masked interleaved memory accesses, so enabling tail-folding leads to insane fixed-width VF costs. At the same time the loop vectoriser does not support vectorising interleaved memory accesses for scalable vectors either, so we end up in a situation where the vectoriser decides not to vectorise at all! Whereas if we don't enable tail-folding we will vectorise using a fixed-width VF and use NEON's ld2/st2/etc instructions, which is often still faster than a scalar loop. Ultimately in the long term we would like to change the loop vectoriser to consider a matrix of costs, with vectorisation style on one axis and VF on the other, then choose the most optimal cost in that matrix. But this is a non-trivial piece of work, so in the short term we opted for this temporary solution.

Hello. That sounds like the loops need unrolling - that's common in order to get the best ILP out of cores. Especially for in-order cores, but it is important for out of order cores too. Something like the Neoverse-N2 has 2 FP/ASIMD pipes, and in order to keep them busy you need to unroll small loops. The Neoverse-V1 has 4 FP/ASIMD pipes. But that's not limited to reductions, it applies to any small loop as far as I understand. This is the reason that we use a MaxInterleaveFactor of at least 2 for all cores in llvm, and some set it higher. I don't think that changes with SVE. It still needs to unroll the loop body, preferably with a predicated epilogue.

It is true that sometimes this extra unrolling is unhelpful for loops with small trip counts. But I was hoping that the predicated epilogue would help with that. I thought that was the plan? What happened to making an unrolled/unpredicated body with a predicated remainder? There will be a loops that are large enough that we do not unroll. Those probably make sense to predicate, so long as the performance is good (it will depend on the core). For Os too. For MVE it certainly made sense once we had worked through fixing all the performance degradations we could find, but those are very different cores which pay more heavily for code-structure inefficiencies. And they (almost) never need the unrolling.

For AArch64 the interleaving count will be based on the cost of the loop, among other things. Maybe this should be based on the cost of the loop too? Does tail folding really mean we cannot unroll?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D130618/new/

https://reviews.llvm.org/D130618