[llvm] [LV] Add a flag to conservatively choose a larger vector factor when maximizing bandwidth (PR #156012)
Yuta Mukai via llvm-commits
llvm-commits at lists.llvm.org
Thu Sep 4 06:13:07 PDT 2025
ytmukai wrote:
> This sounds like the cost of those wide vector instructions is calculated incorrectly, not considering the cost of the extra instructions. Could this not be fixed directly in the cost model?
Even when an instruction count-based cost calculation is accurate, selecting the cheaper option can degrade overall performance by increasing the load on a bottleneck pipeline. The loop below illustrates this.
https://godbolt.org/z/h6n34qfEr
```c
void f(int n, short *restrict a, long *b, double *c) {
for (int i = 0; i < n; i++) {
a[i] = b[i] + c[i];
}
}
```
With SVE, enabling `-vectorizer-maximize-bandwidth` causes the vectorizer to select `vscale x 8`. The corresponding cost calculation and generated instruction sequence are shown below:
```
.LBB0_9:
lsl x12, x11, #3
ld1d { z0.d }, p0/z, [x2, x11, lsl #3]
ld1d { z4.d }, p0/z, [x3, x11, lsl #3]
add x13, x2, x12
add x12, x3, x12
ldr z1, [x13, #1, mul vl]
ldr z2, [x13, #2, mul vl]
ldr z3, [x13, #3, mul vl]
scvtf z0.d, p0/m, z0.d
ldr z5, [x12, #3, mul vl]
ldr z6, [x12, #2, mul vl]
ldr z7, [x12, #1, mul vl]
scvtf z3.d, p0/m, z3.d
scvtf z2.d, p0/m, z2.d
scvtf z1.d, p0/m, z1.d
fadd z0.d, z4.d, z0.d
fadd z1.d, z7.d, z1.d
fadd z2.d, z6.d, z2.d
fadd z3.d, z5.d, z3.d
fcvtzs z0.d, p0/m, z0.d
fcvtzs z3.d, p0/m, z3.d
fcvtzs z2.d, p0/m, z2.d
fcvtzs z1.d, p0/m, z1.d
uzp1 z2.s, z2.s, z3.s
uzp1 z0.s, z0.s, z1.s
uzp1 z0.h, z0.h, z2.h
st1h { z0.h }, p1, [x1, x11, lsl #1]
inch x11
cmp x9, x11
b.ne .LBB0_9
Cost of 1 for VF vscale x 8: induction instruction %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
Cost of 1 for VF vscale x 8: exit condition instruction %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
Cost of 4 for VF vscale x 8: WIDEN ir<%0> = load vp<%6>
Cost of 4 for VF vscale x 8: WIDEN-CAST ir<%conv> = sitofp ir<%0> to double
Cost of 4 for VF vscale x 8: WIDEN ir<%1> = load vp<%7>
Cost of 4 for VF vscale x 8: WIDEN ir<%add> = fadd ir<%1>, ir<%conv>
Cost of 7 for VF vscale x 8: WIDEN-CAST ir<%conv3> = fptosi ir<%add> to i16
Cost of 1 for VF vscale x 8: WIDEN store vp<%8>, ir<%conv3>
Cost for VF vscale x 8: 26 (Estimated cost per lane: 1.6)
(Zero-cost entries are omitted)
```
The cost of the pack instructions is accounted for in the `fptosi` cost (calculated as `7 = fcvtzs x 4 + uzp1 x 3`). The other costs are also consistent with their respective instruction counts. In actual measurements, the number of executed instructions is indeed lower than with the `vscale x 2` version.
However, on Neoverse V2, the execution time increased (measured with n=512). I believe the reason for this, as shown in the table below, is that the workload balance shifted and increased the load on the vector operation pipeline.
| | Selected VF | Cost Model | #Cycles | #Instructions | #Loads/Stores | #Vector Ops |
|---------------------------------------|-------------|--------------------|--------:|--------------:|--------------:|------------:|
| Default | vscale x 2 | Cost=8 (2.0/lane) | 286 | 2207 | 384 | 384 |
| -mllvm -vectorizer-maximize-bandwidth | vscale x 8 | Cost=26 (1.6/lane) | 451 | 1951 | 288 | 480 |
https://github.com/llvm/llvm-project/pull/156012
More information about the llvm-commits
mailing list