[llvm] Add `llvm.vector.partial.reduce.fadd` intrinsic (PR #159776)

Mon Sep 22 07:47:30 PDT 2025

================
@@ -20614,6 +20614,48 @@ performance, and an out-of-loop phase to calculate the final scalar result.
 By avoiding the introduction of new ordering constraints, these intrinsics
 enhance the ability to leverage a target's accumulation instructions.
 
+'``llvm.vector.partial.reduce.fadd.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare <4 x f32> @llvm.vector.partial.reduce.fadd.v4f32.v8f32(<4 x f32> %a, <8 x f32> %b)
+      declare <vscale x 4 x f32> @llvm.vector.partial.reduce.fadd.nxv4f32.nxv8f32(<vscale x 4 x f32> %a, <vscale x 8 x f32> %b)
+
+Overview:
+"""""""""
+
+The '``llvm.vector.partial.reduce.fadd.*``' intrinsics reduce the
+concatenation of the two vector arguments down to the number of elements of the
+result vector type.
+
+Arguments:
+""""""""""
+
+The first argument is a floating-point vector with the same type as the result.
+
+The second argument is a vector with a length that is a known integer multiple
+of the result's type, while maintaining the same element type.
+
+Semantics:
+""""""""""
+
+Other than the reduction operator (e.g. fadd) the way in which the concatenated
+arguments is reduced is entirely unspecified. By their nature these intrinsics
+are not expected to be useful in isolation but instead implement the first phase
+of an overall reduction operation.
+
+The typical use case is loop vectorization where reductions are split into an
+in-loop phase, where maintaining an unordered vector result is important for
+performance, and an out-of-loop phase to calculate the final scalar result.
+
+By avoiding the introduction of new ordering constraints, these intrinsics
+enhance the ability to leverage a target's accumulation instructions.
----------------
paulwalker-arm wrote:

Using the intrinsic "is" the optimisation.  Today LoopVectorize has to settle for a poor IR representation for how it performs vector reductions which "does" hinder optimisation.  The new intrinsic allows LoopVectorize to express its true intent without introducing unnecessary requirements.  Once the IR is fully expressive there is nothing stoping later transformations applying ordering restrictions if there is a benefit in doing so. I'm just not sure such variants can be performance compatible across all targets, whereas the unordered variant is easily implemented as normal fadd or reduce.fadd operations.

That said, I do think the intrinsic should have fast-math-flags like `llvm.vector.reduce.fadd` but here we mandate (and verify) the presence of `reassoc`. That would make the intrinsic's intent clearer and gives us the option in the future to have a variant that uses a defined order.

https://github.com/llvm/llvm-project/pull/159776