[llvm] [LoopUnroll] Introduce parallel reduction phis when unrolling. (PR #149470)

Sun Jul 20 04:37:58 PDT 2025

fhahn wrote:

> There are also cases where we can interleave, but not unroll, so it might make sense to do both. For example:
> 
> ```
> void f(int *p, int *sum) {
>   #pragma clang loop vectorize_width(1) interleave_count(4)
>   for (int i = 0; i < 10000; ++i) {
>     *sum += p[i];
>   }
> }
> ```

Yep, currently LoopVectorize's interleaving will always take precedence, as it runs before partial/runtime unrolling. The patch here doesn't change any cost-decisions yet, just improves the throughput if we already decided to unroll (either partially or runtime unrolling; when fully unrolling, SLPVectorizer/backend should handle reassociation to improve throughput).

With this capability in the unroller, some loops can become profitable to runtime/partially unroll on some platforms. https://github.com/llvm/llvm-project/pull/149699 enables partial/runtime unrolling for Apple CPUs for loops with reductions.

For the loop above, I think we could partially unroll it (and introduce parallel reduction phis), but for most AArch64 CPUs that is disabled at the moment, but it gets unrolled (if vectorization is disabled) for cortex-a55 for example: https://clang.godbolt.org/z/4xcvYorPe

https://github.com/llvm/llvm-project/pull/149470