[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics
Sander De Smalen via llvm-dev
llvm-dev at lists.llvm.org
Thu Apr 4 09:07:28 PDT 2019
The reason for the asymmetry and requiring an explicit start-value operand is to be able to describe strict reductions that need to preserve the same associativity of a scalarized reduction.
%res = call float @llvm.experimental.vector.reduce.fadd(float %start, <4 x float> <float %elt0, float %elt1, float %elt2, float %elt3>)
describes the following reduction:
%res = (((%start + %elt0) + %elt1) + %elt2) + %elt3
%tmp = call float @llvm.experimental.vector.reduce.fadd(<4 x float> <float %elt0, float %elt1, float %elt2, float %elt3>)
%res = add float %start, %tmp
%res = %start + (((%elt0 + %elt1) + %elt2) + %elt3)
Which is not the same, hence why the start operand is needed in the intrinsic itself. For fast-math (specifically the 'reassoc' property) the compiler is free to reassociate the expression, so the start/accumulator operand isn't needed.
On 04/04/2019, 16:44, "David Greene" <dag at cray.com> wrote:
Sander De Smalen via llvm-dev <llvm-dev at lists.llvm.org> writes:
> This means that for example:
> %res = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %v)
> does not result in %res being 'undef', but rather a reduction of <4 x
> float> %v. The definition of these intrinsics are different from their
> corresponding SelectionDAG nodes which explicitly split out a
> non-strict VECREDUCE_FADD that explicitly does not take a start-value
> operand, and a VECREDUCE_STRICT_FADD which does.
This seems very strange to me. What was the rationale for ignoring the
first argument? What was the rationale for the first argument existing
at all? Because that's how SVE reductions work? The asymmetry with
llvm.experimental.vector.reduce.add is odd.
> [Option B] Having separate ordered and unordered intrinsics (https://reviews.llvm.org/D60262).
> declare float @llvm.experimental.vector.reduce.ordered.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
> declare float @llvm.experimental.vector.reduce.unordered.fadd.f32.v4f32(<4 x float> %vec)
> This will mean that the behaviour is explicit from the intrinsic and
> the use of 'fast' or ‘reassoc’ on the call has no effect on how that
> intrinsic is lowered. The ordered reduction intrinsic will take a
> scalar start-value operand, where the unordered reduction intrinsic
> will only take a vector operand.
This seems by far the better solution. I'd much rather have things be
explicit in the IR than implicit via flags that might accidentally get
Again, the asymmetry between these (one with a start value and one
without) seems strange and arbitrary. Why do we need start values at
all? Is it really difficult for isel to match s + vector.reduce(v)?
More information about the llvm-dev