[llvm-dev] [RFC] Changes to llvm.experimental.vector.reduce intrinsics

Thu Apr 4 14:26:05 PDT 2019

Hi Sander,

thank you for pushing these intrinsics forward. I am not familiar
enough with the details to say whether anything else might be missing
for them to become less experimental, but the changes you propose all
look like clear-cut improvements to me.

On Thu, 4 Apr 2019 at 15:11, Sander De Smalen via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>
> Hi all,
>
>
>
> While working on a patch to improve codegen for fadd reductions on AArch64, I stumbled upon an outstanding issue with the experimental.vector.reduce.fadd/fmul intrinsics where the accumulator argument is ignored when fast-math is set on the intrinsic call. This behaviour is described in the LangRef (https://www.llvm.org/docs/LangRef.html#id1905) and is mentioned in https://bugs.llvm.org/show_bug.cgi?id=36734 and further discussed in D45336 and D59356.
>
>
>
> This means that for example:
>
>     %res = call fast float @llvm.experimental.vector.reduce.fadd.f32.v4f32(float undef, <4 x float> %v)
>
>
>
> does not result in %res being 'undef', but rather a reduction of <4 x float> %v. The definition of these intrinsics are different from their corresponding SelectionDAG nodes which explicitly split out a non-strict VECREDUCE_FADD that explicitly does not take a start-value operand, and a VECREDUCE_STRICT_FADD which does.
>
>
>
> With the vector reduction intrinsics still experimental, I would like to propose to change this behaviour. I would also like to take this opportunity to ask what other changes are required, so that we can make an effort towards dropping the 'experimental' prefix at some point.

Big +1 for removing this pitfall. It's a very counterintuitive design
IMO and is partially responsible for some unnecessarily complex code
in the Rust compiler, so I'd be happy to see it go.

>
>
>
> Proposed change:
>
> ----------------------------
>
> In this RFC I propose changing the intrinsics for llvm.experimental.vector.reduce.fadd and llvm.experimental.vector.reduce.fmul (see options A and B). I also propose renaming the 'accumulator' operand to 'start value' because for fmul this is the start value of the reduction, rather than a value to which the fmul reduction is accumulated into.
>
>
>
> [Option A] Always using the start value operand in the reduction (https://reviews.llvm.org/D60261)
>
>
>
>   declare float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float %start_value, <4 x float> %vec)
>
>
>
> This means that if the start value is 'undef', the result will be undef and all code creating such a reduction will need to ensure it has a sensible start value (e.g. 0.0 for fadd, 1.0 for fmul). When using 'fast' or ‘reassoc’ on the call it will be implemented using an unordered reduction, otherwise it will be implemented with an ordered reduction. Note that a new intrinsic is required to capture the new semantics. In this proposal the intrinsic is prefixed with a 'v2' for the time being, with the expectation this will be dropped when we remove 'experimental' from the reduction intrinsics in the future.

I like that this option will allow some code (e.g., frontends,
constant folding, certain instcombines) to treat unordered reductions
as essentially the same as ordered ones without having to go out its
way to cover two different intrinsics with slightly different argument
lists.

Cheers,
Robin