[llvm] [LV][AArch64] Prefer Fixed over Scalable if cost-model is equal (Neoverse V2) (PR #95819)

Wed Jun 19 00:21:16 PDT 2024

david-arm wrote:

So here are my thoughts. It would be very useful to have some performance data attached to this PR for benchmarks such as SPEC and/or real world applications demonstrating the problems. As far as I understand it, TSVC is not really meant for benchmarking and is in fact meant to showcase the breadth of a compiler's vectorisation capabilities. Although I do understand it's useful for highlighting potential issues. Like @paulwalker-arm said, it's not about blindly favouring SVE in the face of overwhelming data, but more about making the best choices based on data.

Here are my thoughts:

1. If there is a problem with the SVE inc instructions then there is probably a 1-3 line fix for that. Would that fix most of the issues? It would be good to know and if it's proven to be an issue we should do that regardless.
2. If there is a problem with poor codegen when interleaving with SVE due to the calculation of load addresses then we should definitely fix that, but it's not really a fundamental reason why SVE is worse than NEON. It seems to me that's a problem with the compiler, rather than a problem with the architecture and we should look into it.
3. It is true that NEON has instructions such as LDP and STP that SVE2 does not have but those mostly come into play when interleaving (which GCC isn't doing in the example above). What this patch is doing is effectively disabling SVE for, I assume, the majority of loops, even if we're not interleaving. For example, many of the loops in the SPEC Fortran benchmarks are too large for interleaving.

Having said that, I definitely understand the LDP, STP problem, but I agree with @paulwalker-arm that if possible it would be nice to see this taken account of with better cost modelling. This is just a suggestion, but you could tweak this patch as follows:

1. Instead of choosing upfront whether we prefer a fixed or scalable VF in the event of a tie, we could instead record the tie for later use.
2. After invoking `selectInterleaveCount` from LoopVectorize.cpp we can then call a version of this new TTI hook that passes in the Loop pointer, the fixed-width VF and the chosen interleave count. At that point we can choose between the tie at a point in the code where we are able to make a more informed choice.
3. In your new hook you can have code like this:

```
bool preferFixedIfEqualToScalable(const Loop *L, ElementCount VF, unsigned IC) {
  if (IC == 1) // Don't expect to use LDP or STP
    return false;

  for (BasicBlock *B : ...)
    for (Instruction *I : ...)
      if ((isa<LoadInst> || isa<StoreInst>) && preferNeonForInterleavedLoadStore(I, VF, IC))
        return true;

  return false;
}
```

where `preferNeonForInterleavedLoadStore` can check whether or not it's possible to use LDP or STP here.

What do you think?

https://github.com/llvm/llvm-project/pull/95819