[PATCH] D120912: [AArch64][SVE] Convert gather/scatter with a stride of 2 to contiguous loads/stores

Thu Mar 10 06:05:20 PST 2022

kmclaughlin added a comment.

Hi @rscottmanley, thank you for taking a look at this patch.

In D120912#3364546 <https://reviews.llvm.org/D120912#3364546>, @rscottmanley wrote:

> - 64b scatters being as fast/faster than either contiguous stores or st2 is amazing if true, but is that always going to be true for all SVE targets? I'm guessing some sort of "preferScatter()" or "preferGather()" on a per target basis will (eventually?) be needed for this, but I don't have access to different SVE capable chips.

We expect it to be faster on Neoverse-v1 for 32b/64b gathers & 64b scatters with a stride of two, so those are the cases we're currently optimising for. You're right that it makes sense to tune this per subtarget and I've added this now.

In D120912#3364546 <https://reviews.llvm.org/D120912#3364546>, @rscottmanley wrote:

> - I understand why the contiguous sequence is needed over ld2/st2, but what about when you know it is safe to use ld2/st2? If the contiguous sequence you have here is indeed faster, would existing combines that match to ld2/st2 be combined to this instead?
> - Should this combine apply of the "odd" elements which are also stride2 { 1, 3, 5, 7...}?
> - Does this change still allow a match to ld2/st2 when all the elements are accessed by a **pair ** of gathers/scatters?

This approach is largely a stop-gap until we have proper (de)interleaving in the loop vectoriser, or when the InterleavedAccess pass supports scalable vectors. We would expect those transformations to use an appropriate cost-model to decide whether to use gathers/scatters, ld2/st2 or explicit (de)interleaving intrinsics. All of these transformations would happen before legalisation, so if it ends up here as a gather that was either because it wasn't safe to use ld2/st2 or the cost-model argued against it. That makes this a bit of a last-resort DAGCombine fold to improve code quality. Until we have proper (de)interleaving support in the passes mentioned above, this mechanism will at least already improve performance for those subtargets where this is enabled.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120912/new/

https://reviews.llvm.org/D120912