[llvm] [LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) (PR #95819)

Wed Jun 19 02:10:41 PDT 2024

sjoerdmeijer wrote:

> So here are my thoughts. It would be very useful to have some performance data attached to this PR for benchmarks such as SPEC and/or real world applications demonstrating the problems. As far as I understand it, TSVC is not really meant for benchmarking and is in fact meant to showcase the breadth of a compiler's vectorisation capabilities. Although I do understand it's useful for highlighting potential issues. Like @paulwalker-arm said, it's not about blindly favouring SVE in the face of overwhelming data, but more about making the best choices based on data.
> 
> Here are my thoughts:
> 
> 1. If there is a problem with the SVE inc instructions then there is probably a 1-3 line fix for that. Would that fix most of the issues? It would be good to know and if it's proven to be an issue we should do that regardless.
> 2. If there is a problem with poor codegen when interleaving with SVE due to the calculation of load addresses then we should definitely fix that, but it's not really a fundamental reason why SVE is worse than NEON. It seems to me that's a problem with the compiler, rather than a problem with the architecture and we should look into it.
> 3. It is true that NEON has instructions such as LDP and STP that SVE2 does not have but those mostly come into play when interleaving (which GCC isn't doing in the example above). What this patch is doing is effectively disabling SVE for, I assume, the majority of loops, even if we're not interleaving. For example, many of the loops in the SPEC Fortran benchmarks are too large for interleaving.
> 
> Having said that, I definitely understand the LDP, STP problem, but I agree with @paulwalker-arm that if possible it would be nice to see this taken account of with better cost modelling. This is just a suggestion, but you could tweak this patch as follows:
> 
> 1. Instead of choosing upfront whether we prefer a fixed or scalable VF in the event of a tie, we could instead record the tie for later use.
> 2. After invoking `selectInterleaveCount` from LoopVectorize.cpp we can then call a version of this new TTI hook that passes in the Loop pointer, the fixed-width VF and the chosen interleave count. At that point we can choose between the tie at a point in the code where we are able to make a more informed choice.
> 3. In your new hook you can have code like this:
> 
> ```
> bool preferFixedIfEqualToScalable(const Loop *L, ElementCount VF, unsigned IC) {
>   if (IC == 1) // Don't expect to use LDP or STP
>     return false;
> 
>   for (BasicBlock *B : ...)
>     for (Instruction *I : ...)
>       if ((isa<LoadInst> || isa<StoreInst>) && preferNeonForInterleavedLoadStore(I, VF, IC))
>         return true;
> 
>   return false;
> }
> ```
> 
> where `preferNeonForInterleavedLoadStore` can check whether or not it's possible to use LDP or STP here.
> 
> What do you think?

I think this is excellent. Thanks for writing this all up! This is pretty much what I now had in mind too, so yes, let's go for this approach.

To be clear, I don't care about TSVC. It's well appreciated it's synthetic, and I can argue that TSVC is as synthetic as x264 from SPEC INT, which is the app that is most amenable to vectorisation in the suite because real x264 implementations are not implemented like that. But the way I look at this is getting the basics right. Small loops do exist, and also for example, if you create a reproducer, unpredictable and bad performance is really not helping, and that's an understatement. I already mentioned my performance data: SPEC INT + FP are unaffected, and I see 2x improvements in TSVC, and some other benchmark. I will get additional data for the llvm test-suite. But it sounded like @davemgreen did a sweep too.

I will now go ahead and change this patch as suggested. Please do shout if there are still concerns. 
I do want to note that I am intending to keep #95819 because I only want to do this for the V2. Our new hook `preferFixedIfEqualToScalable` will first check  `TTI.preferFixedIfEqualToScalable()` and early exits, because I can't do the benchmarking on (most) other targets. 

https://github.com/llvm/llvm-project/pull/95819