[llvm] [LV][AArch64]: Utilise SVE ld4/st4 instructions via auto-vectorisation (PR #89018)

Sun Apr 21 12:27:33 PDT 2024

davemgreen wrote:

The loop vectorizer will produce costs via getInterleavedMemoryOpCost so should be fine as far as I understand. If there are no combines later on (either uncosted in instcombine or costed in vector-combine) that work with vector.interleave/vector.deinterleave then they can break the canonical patterns that the backend is expecting to generate ld2/ld4 from. I'm hoping that if we can move to interleave/deinterleave, that should fix some of the problems we have at the moment.

I have recently been adding costs for the existing shuffles we find for fixed length vectors, in an attempt to reduce the number of times we break apart the load+shuffle (or store+shuffle), and have to either attempt to repair it or fall back to worse generation in the backend. I would say that in general costing for single-instructions is fine, two instructions making a pattern (like shuffle(load) or store(shuffle)) are do-able but start to get unreliable, and three-instruction plus becomes difficult to cost well.

https://github.com/llvm/llvm-project/pull/89018