[llvm] [VPlan] Add the cost of spills when considering register pressure (PR #179646)
John Brawn via llvm-commits
llvm-commits at lists.llvm.org
Wed Feb 4 04:31:26 PST 2026
john-brawn-arm wrote:
The motivation for doing this is that I'm looking at enabling shouldConsiderVectorizationRegPressure on Arm Cortex-M CPUs with MVE, and the current behaviour makes things significantly worse in some cases due to preventing vectorization when it's beneficial. I've been specifically looking at the code we generate for https://github.com/ARM-software/CMSIS-DSP on Cortex-M55. If I enable vectorization register pressure then with the current behaviour the change in throughput is
| Function | Change |
| -------- | ------ |
| Filtering/arm_conv_partial_q31 | 3.64% |
| Filtering/arm_conv_q31 | 1.37% |
| Filtering/arm_correlate_q31 | -1.15% |
| Filtering/arm_fir_decimate_f32 | -0.92% |
| Filtering/arm_fir_decimate_q31 | -56.92% |
| Filtering/arm_fir_f32_16taps | 187.17% |
| Filtering/arm_fir_f32_4taps | 34.33% |
| Filtering/arm_fir_f32_8taps | 350.95% |
| Filtering/arm_fir_q31_16taps | -4.88% |
| Filtering/arm_fir_q31_4taps | 0.54% |
| Filtering/arm_fir_q31_8taps | -9.04% |
| Matrix/arm_mat_vec_mult_f16 | -52.49% |
| Matrix/arm_mat_vec_mult_f32 | -48.69% |
| Quaternion/arm_quaternion2rotation_f32 | -4.66% |
| Transform/arm_cfft_f16 | -7.94% |
| Transform/arm_rfft_fast_f16 | -0.39% |
| Transform/arm_rfft_fast_f32 | -8.86% |
With this PR the change in throughput is
| Function | Change |
| -------- | ------ |
| Filtering/arm_fir_f32_16taps | 187.17% |
| Filtering/arm_fir_f32_4taps | 34.33% |
| Filtering/arm_fir_f32_8taps | 350.95% |
| Filtering/arm_fir_q31_16taps | -4.88% |
| Filtering/arm_fir_q31_4taps | 0.54% |
| Filtering/arm_fir_q31_8taps | -9.04% |
| Quaternion/arm_quaternion2rotation_f32 | -4.66% |
The remaining regressions are due to the relative costs of interleave vs gather/scatter vs scalarize being wrong in some cases, which I'll be looking at next.
I've also checked llvm-test-suite on Neoverse-V2 (AWS Graviton 4), where useMaxBandwidth is enabled for scalable vectors and so register pressure calculation is used, and there's zero change in code generation.
https://github.com/llvm/llvm-project/pull/179646
More information about the llvm-commits
mailing list