[llvm] [AArch64][SVE] Enable max vector bandwidth for SVE (PR #109671)

Wed Jul 16 00:47:41 PDT 2025

ytmukai wrote:

We aim to enable this change by incorporating conservative criteria to mitigate the risk of performance degradation. Through several analyses, we identified two scenarios that could potentially lead to performance issues:

- Cases requiring mask packing/unpacking
- Cases where vector packing/unpacking causes bottlenecks in the computation pipeline

By enabling vector width maximization only when these conditions are not met, we believe we can avoid many performance regressions. We welcome any feedback or advice.

Details of the analysis are as follows:

Regarding mask packing/unpacking, the SPEC cam4_r case was found to be affected by this issue. Extracting the loop that experienced performance degradation, we have:

https://godbolt.org/z/jshbPoheT

```c
// Extracted from cam4_r vert_interp routine
void f(int n, float *restrict a, double *b, double *c, double *d) {
  for (int i = 0; i < n; i++) {
    if ((b[i] < c[i]) & (c[i] < d[i]))
      a[i] = n;
  }
}
```

When vectorized with `vscale x 4`, packing instructions are required to use a mask created with double (vscale x 2) in float (vscale x 4). However, the cost model does not account for these instructions, leading to an underestimation of the cost. (In type conversion cases, the cost of packing/unpacking instructions is represented as the cost of type conversion instructions.)

The actual execution results on Neoverse V2 with `n=512` are as follows: (Compiled with `-mcpu=fujitsu-monaka` to prevent ASIMD selected.)

|                                       | Selected VF | Cost Model         | #Cycles | #Instructions |
|---------------------------------------|-------------|--------------------|--------:|--------------:|
| Default                               |  vscale x 2 | Cost=9 (2.2/lane)  |     512 |          2463 |
| -mllvm -vectorizer-maximize-bandwidth |  vscale x 4 | Cost=14 (1.8/lane) |     589 |          2529 |

Although the cost model suggests vscale x 4 is superior, actual execution shows increased time and instruction count. In the case of `vscale x 4`, 30% of the instructions involve mask packing.

Excluding cases with potential mask packing/unpacking requirements, such as if statements, can prevent performance degradation due to incorrect VF selection.

Additionally, we found that performance can degrade even without an increase in instruction count. The following loop is an example:

https://godbolt.org/z/Gr6rEevbP

```c
void f(int n, short *restrict a, long *b, double *c) {
  for (int i = 0; i < n; i++) {
    a[i] = b[i] + c[i];
  }
}
```

The execution results on Neoverse V2 with `n=512` are as follows:

|                                       | Selected VF | Cost Model         | #Cycles | #Instructions | #Loads/Stores | #Vector Ops |
|---------------------------------------|-------------|--------------------|--------:|--------------:|--------------:|------------:|
| Default                               |  vscale x 2 | Cost=8 (2.0/lane)  |     286 |          2207 |           384 |         384 |
| -mllvm -vectorizer-maximize-bandwidth |  vscale x 8 | Cost=26 (1.6/lane) |     451 |          1951 |           288 |         480 |

In this case, packing/unpacking instructions are considered in the cost model, and the actual instruction count for `vscale x 8` is lower. However, the increased pressure on the vector computation pipeline due to packing/unpacking causes a bottleneck, resulting in performance degradation. To avoid this, adopting a larger VF only when vector computation costs improve could be considered. (Ideally, the cost model should represent pipeline types.)

We tested the performance by randomly selecting from the following types for each variable in the loop above: `char`, `short`, `int`, `long`, `__fp16`, `_Float16`, `float`, and `double`. With `-vectorizer-maximize-bandwidth` enabled, out of 100 cases, execution time increased by more than 10% in 18 cases and decreased by more than 10% in 14 cases. By limiting to cases where vector computation costs do not increase, the number of cases with a 10% increase dropped to 3, while those with a 10% decrease remained at 12.

https://github.com/llvm/llvm-project/pull/109671