[PATCH] D120912: [AArch64][SVE] Convert gather/scatter with a stride of 2 to contiguous loads/stores

Mon Mar 7 09:59:57 PST 2022

rscottmanley added a comment.

I find the performance claims interesting. I've asked this question to SVE hw engineers before on which is faster -- gather vs load and shuffle vs load2 and the answer was essentially "depends on your loop". If you're using up ports to shuffle that could otherwise be used for computation, it seems like this would be a loser. If that analysis is correct then IMO this decision should be made in LV and the backend should honor the gather. However, if it stays in the backend, I still have some comments:

- 64b scatters being as fast/faster than either contiguous stores or st2 is amazing if true, but is that always going to be true for all SVE targets? I'm guessing some sort of "preferScatter()" or "preferGather()" on a per target basis will (eventually?) be needed for this, but I don't have access to different SVE capable chips.

- I understand why the contiguous sequence is needed over ld2/st2, but what about when you know it is safe to use ld2/st2? If the contiguous sequence you have here is indeed faster, would existing combines that match to ld2/st2 be combined to this instead?

- Should this combine apply of the "odd" elements which are also stride2 { 1, 3, 5, 7...}?

- Does this change still allow a match to ld2/st2 when all the elements are accessed by a **pair ** of gathers/scatters?

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D120912/new/

https://reviews.llvm.org/D120912