[llvm] Add `llvm.vector.partial.reduce.fadd` intrinsic (PR #159776)

Wed Sep 24 06:56:34 PDT 2025

================
@@ -20614,6 +20614,48 @@ performance, and an out-of-loop phase to calculate the final scalar result.
 By avoiding the introduction of new ordering constraints, these intrinsics
 enhance the ability to leverage a target's accumulation instructions.
 
+'``llvm.vector.partial.reduce.fadd.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare <4 x f32> @llvm.vector.partial.reduce.fadd.v4f32.v8f32(<4 x f32> %a, <8 x f32> %b)
+      declare <vscale x 4 x f32> @llvm.vector.partial.reduce.fadd.nxv4f32.nxv8f32(<vscale x 4 x f32> %a, <vscale x 8 x f32> %b)
+
+Overview:
+"""""""""
+
+The '``llvm.vector.partial.reduce.fadd.*``' intrinsics reduce the
+concatenation of the two vector arguments down to the number of elements of the
+result vector type.
+
+Arguments:
+""""""""""
+
+The first argument is a floating-point vector with the same type as the result.
+
+The second argument is a vector with a length that is a known integer multiple
+of the result's type, while maintaining the same element type.
+
+Semantics:
+""""""""""
+
+Other than the reduction operator (e.g. fadd) the way in which the concatenated
+arguments is reduced is entirely unspecified. By their nature these intrinsics
+are not expected to be useful in isolation but instead implement the first phase
+of an overall reduction operation.
+
+The typical use case is loop vectorization where reductions are split into an
+in-loop phase, where maintaining an unordered vector result is important for
+performance, and an out-of-loop phase to calculate the final scalar result.
+
+By avoiding the introduction of new ordering constraints, these intrinsics
+enhance the ability to leverage a target's accumulation instructions.
----------------
dheaton-arm wrote:

So, given the point @paulwalker-arm raised here: https://github.com/llvm/llvm-project/pull/159776#pullrequestreview-3262719626

I'm inclined to think that it makes the most sense to keep consistency between the partial reduction intrinsics; in which case, the options are to change the semantics of `llvm.vector.partial.reduction.add` -- which would be out of scope for this PR in my opinion -- or we match its semantics for this intrinsic (at least for now), in which case both of them could be changed in the same way later if we so desire that.

Separately, I think this intrinsic does still require a note about reassociation, since the lowering to `fdot`, for example, could lead to a slightly different result than the `fpext->fmul->fadd` sequence. If we're content that the `reassoc` flag isn't suited for this use-case, though, I'm happy to revert the change requiring it per the retracted feedback.

https://github.com/llvm/llvm-project/pull/159776