[llvm] [LV] Add on extra cost for scalarising math calls in vector loops (PR #158611)

Tue Sep 16 02:12:04 PDT 2025

david-arm wrote:

> The costs of calls is always a bit difficult. The fmovs are not exactly free but sometimes close to it, and the rest is not far away from the existing 4*call+scalarization overhead.
> 
> I guess the general problem is that any call without a vector calling convention will cause spilling of v/z vector registers if they need to be live across it.

Yeah you raise a good point - the costs do seem very high, but I set the cost this high mainly because it was the only to way avoid significant regressions in benchmarks like wrf in future when I plan to lower the cost of the 128-bit masked loads and stores (currently a cost of 18 for VF=4). It's due to exactly the problem you described about spilling and filling (or rematerialising) SVE predicate and v/z registers. It wasn't obvious to me that we could ever improve the generated code without changing the ABI for libm math routines. In many loops the only thing currently preventing a terrible choice of vectorisation is the very high cost of the fixed-width masked loads and stores. The generated code in such loops (such as wrf) is often so bad that we end up spilling (or rematerialising) 4 or 5 SVE predicate or NEON/SVE vector registers around every function call (and there are several in a single loop). What I see is that the more work done in the loop, the worse it gets so it's not like the cost gets amortised. I suppose a more accurate way of costing would be to guess how the backend intends to schedule and allocate registers in code before and after the call, but I assume that would be fragile and computationally expensive.

When trying out some of my hand-written micro-benchmarks (as well as wrf, etc) I couldn't see any performance benefits to vectorisation when scalarising math calls, and it also significantly increased the code size. I do want to improve the cost model for fixed-width masked loads and stores because they simply don't represent the generated code, but I'm currently held hostage by this math call scalarisation issue and it's difficult to know how else to proceed.

I'm open to suggestions about other ideas on how to progress! One of the other problems I've noticed is that if a math call is predicated in C code like this example:

```
  for (int i = 0; i < n; i++) {
    if (cond[i] > 0.3)
      dst[i] += expf(src[i] * src[i]);
  }
```

when building with -ffast-math the scalar version remains in an if-block, but the loop vectoriser if-converts the loop (due to expf being safe to speculatively execute). Then in the backend we scalarise the vector intrinsic form of expf into individual scalar calls. If the condition is only triggered half of the time then the scalar loop is always going to be faster. The trouble here is that without PGO data we don't know whether we should flatten the loop or not. In such cases I could increase the call scalarisation cost further to prevent vectorisation, which would allow me to drop the cost for loops with unconditional math calls.

https://github.com/llvm/llvm-project/pull/158611