[llvm] [LoopUnroll] Introduce parallel reduction phis when unrolling. (PR #149470)
Florian Hahn via llvm-commits
llvm-commits at lists.llvm.org
Sun Jul 20 04:37:58 PDT 2025
fhahn wrote:
> There are also cases where we can interleave, but not unroll, so it might make sense to do both. For example:
>
> ```
> void f(int *p, int *sum) {
> #pragma clang loop vectorize_width(1) interleave_count(4)
> for (int i = 0; i < 10000; ++i) {
> *sum += p[i];
> }
> }
> ```
Yep, currently LoopVectorize's interleaving will always take precedence, as it runs before partial/runtime unrolling. The patch here doesn't change any cost-decisions yet, just improves the throughput if we already decided to unroll (either partially or runtime unrolling; when fully unrolling, SLPVectorizer/backend should handle reassociation to improve throughput).
With this capability in the unroller, some loops can become profitable to runtime/partially unroll on some platforms. https://github.com/llvm/llvm-project/pull/149699 enables partial/runtime unrolling for Apple CPUs for loops with reductions.
For the loop above, I think we could partially unroll it (and introduce parallel reduction phis), but for most AArch64 CPUs that is disabled at the moment, but it gets unrolled (if vectorization is disabled) for cortex-a55 for example: https://clang.godbolt.org/z/4xcvYorPe
https://github.com/llvm/llvm-project/pull/149470
More information about the llvm-commits
mailing list