[llvm] [LV] Add a flag to conservatively choose a larger vector factor when maximizing bandwidth (PR #156012)

Thu Sep 4 06:13:07 PDT 2025

ytmukai wrote:

> This sounds like the cost of those wide vector instructions is calculated incorrectly, not considering the cost of the extra instructions. Could this not be fixed directly in the cost model?

Even when an instruction count-based cost calculation is accurate, selecting the cheaper option can degrade overall performance by increasing the load on a bottleneck pipeline. The loop below illustrates this.

https://godbolt.org/z/h6n34qfEr
```c
void f(int n, short *restrict a, long *b, double *c) {
  for (int i = 0; i < n; i++) {
    a[i] = b[i] + c[i];
  }
}
```

With SVE, enabling `-vectorizer-maximize-bandwidth` causes the vectorizer to select `vscale x 8`. The corresponding cost calculation and generated instruction sequence are shown below:

```
.LBB0_9:
        lsl     x12, x11, #3
        ld1d    { z0.d }, p0/z, [x2, x11, lsl #3]
        ld1d    { z4.d }, p0/z, [x3, x11, lsl #3]
        add     x13, x2, x12
        add     x12, x3, x12
        ldr     z1, [x13, #1, mul vl]
        ldr     z2, [x13, #2, mul vl]
        ldr     z3, [x13, #3, mul vl]
        scvtf   z0.d, p0/m, z0.d
        ldr     z5, [x12, #3, mul vl]
        ldr     z6, [x12, #2, mul vl]
        ldr     z7, [x12, #1, mul vl]
        scvtf   z3.d, p0/m, z3.d
        scvtf   z2.d, p0/m, z2.d
        scvtf   z1.d, p0/m, z1.d
        fadd    z0.d, z4.d, z0.d
        fadd    z1.d, z7.d, z1.d
        fadd    z2.d, z6.d, z2.d
        fadd    z3.d, z5.d, z3.d
        fcvtzs  z0.d, p0/m, z0.d
        fcvtzs  z3.d, p0/m, z3.d
        fcvtzs  z2.d, p0/m, z2.d
        fcvtzs  z1.d, p0/m, z1.d
        uzp1    z2.s, z2.s, z3.s
        uzp1    z0.s, z0.s, z1.s
        uzp1    z0.h, z0.h, z2.h
        st1h    { z0.h }, p1, [x1, x11, lsl #1]
        inch    x11
        cmp     x9, x11
        b.ne    .LBB0_9

Cost of 1 for VF vscale x 8: induction instruction   %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
Cost of 1 for VF vscale x 8: exit condition instruction   %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
Cost of 4 for VF vscale x 8: WIDEN ir<%0> = load vp<%6>
Cost of 4 for VF vscale x 8: WIDEN-CAST ir<%conv> = sitofp ir<%0> to double
Cost of 4 for VF vscale x 8: WIDEN ir<%1> = load vp<%7>
Cost of 4 for VF vscale x 8: WIDEN ir<%add> = fadd ir<%1>, ir<%conv>
Cost of 7 for VF vscale x 8: WIDEN-CAST ir<%conv3> = fptosi ir<%add> to i16
Cost of 1 for VF vscale x 8: WIDEN store vp<%8>, ir<%conv3>
Cost for VF vscale x 8: 26 (Estimated cost per lane: 1.6)
(Zero-cost entries are omitted)
```

The cost of the pack instructions is accounted for in the `fptosi` cost (calculated as `7 = fcvtzs x 4 + uzp1 x 3`). The other costs  are also consistent with their respective instruction counts. In actual measurements, the number of executed instructions is indeed lower than with the `vscale x 2` version.

However, on Neoverse V2, the execution time increased (measured with n=512). I believe the reason for this, as shown in the table below, is that the workload balance shifted and increased the load on the vector operation pipeline.

|                                       | Selected VF | Cost Model         | #Cycles | #Instructions | #Loads/Stores | #Vector Ops |
|---------------------------------------|-------------|--------------------|--------:|--------------:|--------------:|------------:|
| Default                               |  vscale x 2 | Cost=8 (2.0/lane)  |     286 |          2207 |           384 |         384 |
| -mllvm -vectorizer-maximize-bandwidth |  vscale x 8 | Cost=26 (1.6/lane) |     451 |          1951 |           288 |         480 |

https://github.com/llvm/llvm-project/pull/156012