[PATCH] D128342: [LoopVectorize] Disable tail-folding when masked interleaved accesses are unavailable

Tue Jun 28 08:37:32 PDT 2022

david-arm added a comment.

In D128342#3615721 <https://reviews.llvm.org/D128342#3615721>, @dmgreen wrote:

> In terms of MVE - A VLD2/VLD4 cannot be predicated so in that regards we do not support "MaskedInterleavedAccesses". There is code in canTailPredicateLoop that attempts to get that right. Any other interleaving group width will be emulated with a gather/scatter though, which can happily be masked.
> https://godbolt.org/z/KzvEqz439
>
> For SVE my understanding is that LD2/LD3/LD4 can be predicated, and other widths (and current codegen as interleaving is not yet supported) will use gather/scatter which can be masked. In the long run they may have MaskedInterleavedAccesses returning true.
>
> Is the problem that we are trying to use one variable for both Neon and SVE vectorization, where SVE prefers folding the tail, and NEON will need not to?

Hi @dmgreen, when I first discovered this issue it gave me a headache! The fundamental problem here is that we don't create a matrix of vplan costs with vectorisation style (tail-folding, vector loop + scalar epilogue, etc.) as one axis and VF as the other. This means we cannot look at all possible permutations and choose the lowest cost combination, so our only option is to decide up-front whether or not to use tail-folding before we've calculated any costs.

We currently don't have support for interleaving with SVE using ld2/st2/etc because we cannot rely upon shufflevector in the same way that fixed-width vectorisation does. This means that if we choose to tail-fold a loop with masked interleaved accesses the costs for both NEON and SVE will be so high that we will not vectorise at all. Whereas if we didn't choose to tail-fold we'd actually end up vectorising the interleaved memory accesses using NEON's ld2/st3, which is still faster than a scalar loop. So what I'm trying to do here is avoid causing regressions when enabling tail-folding for any target that does not support masked interleaved memory accesses. If at some point in the future we can support them for SVE then useMaskedInterleavedAccesses will return true.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128342/new/

https://reviews.llvm.org/D128342