[llvm] [IR][LangRef] Add partial reduction add intrinsic (PR #94499)

Thu Jun 6 02:01:17 PDT 2024

================
@@ -19209,6 +19209,35 @@ will be on any later loop iteration.
 This intrinsic will only return 0 if the input count is also 0. A non-zero input
 count will produce a non-zero result.
 
+'``llvm.experimental.vector.partial.reduce.add.*``' Intrinsic
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Syntax:
+"""""""
+This is an overloaded intrinsic.
+
+::
+
+      declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v2i32.v8i32(<8 x i32> %in)
+      declare <4 x i32> @llvm.experimental.vector.partial.reduce.add.v4i32.v16i32(<16 x i32> %in)
+      declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv2i32.nxv8i32(<vscale x 8 x i32> %in)
+      declare <vscale x 4 x i32> @llvm.experimental.vector.partial.reduce.add.nxv4i32.nxv16i32(<vscale x 16 x i32> %in)
+
+Overview:
+"""""""""
+
+The '``llvm.vector.experimental.partial.reduce.add.*``' intrinsics do an integer
+``ADD`` reduction of subvectors within a vector, returning each scalar result as
+a lane within a vector. The return type is a vector type with an
+element-type of the vector input and a width a factor of the vector input
+(typically either half or quarter).
----------------
davemgreen wrote:

I haven't been involved in defining these intrinsic internally, but have thought about how they might work before. I'm not sure if it is better to have a generic partial reduction like this or something more specific to dotprod that includes the zext/sext and mul. They both have advantages and disadvantages. The more instructions there are the harder they are to costmodel well, but more can be done with them.

But it would seem that we should be defining _how_ these are expected to reduce the inputs into the output lanes. Otherwise the definition is a bit wishy-washy in a way that can make them more difficult to use than is necessary. I would expect them to perform pair-wise reductions, and might be simpler if they are limited to power-2 so that they can deinterleave in steps.
https://godbolt.org/z/G737aj1n6

The codegen that currently exists doesn't seem to do that though.

https://github.com/llvm/llvm-project/pull/94499