[llvm] [SDAG] Add partial_reduce_sumla node (PR #141267)

Thu Oct 2 01:48:52 PDT 2025

sdesmalen-arm wrote:

> Sorry for out of the blue questions... Is there a plan to handle partial reductions when vectors sizes do not exactly match a hardware instruction (e.g. UDOT)? Right now, it looks such cases do not match and fall back to the ladder algorithm. Do you (folks @arm) think that it may be worth lowering them to a e.g. a sequence of UDOTs instead? Example is `@llvm.experimental.vector.partial.reduce.add.v2i32.v16i32`. How does the upper layer that generates partial reductions (VPlan?) choose the vector sizes? It has to know the exact CPU variant and the IR becomes tied to that variant, is that right?

At the moment the choice of VF is guided by the cost-model. I've recently shared #158641 to try to take different kinds of type legalisation into account. If something will result in inefficient codegen, the cost-model should prevent that VF from being chosen.

The case you mentioned requires some kind of widening, but I don't think we'd normally end up with a v2i32.v16i32, because the loop vectorizer will try to generate a partial reduction with the same scale factor of the extend. For example, if the extend is i8 -> i32 it would have a scale factor of 4, so the vector being reduced would always have a number of lanes that's 4 times the number of lanes in the accumulator/phi node, meaning we'd end up with either: v4i32.v16i32, or v2i32.v8i32. The latter would already result in a single `udot`.

If we would need to handle v4i32.v16i32, we could always do a better job generating code for it. For example we could let ISel implement the following widening: https://godbolt.org/z/aPzdz48Mr

https://github.com/llvm/llvm-project/pull/141267