[llvm-dev] RFC: Generic IR reductions

Tue Jan 31 12:19:00 PST 2017

Hi Amara,

We also had some discussions on the SVE side of reductions on the main
SVE thread, but this description is much more detailed than we had
before.

I don't want to discuss specifically about SVE, as the spec is not out
yet, but I think we can cover a lot of ground until very close to SVE
and do the final step when we get there.

On 31 January 2017 at 17:27, Amara Emerson via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>         1) As our vector lengths are unknown at compile time (and also not a power of 2), we cannot generate the same reduction IR pattern as other targets. We need a direct IR representation of the reduction in order to vectorize them. SVE also contains strict ordering FP reductions, which cannot be done using the tree-reduction.

Not SVE specific, for example fast-math.

>         2) That we can use reduction intrinsics to implement our proposed SVE "test" instruction, intended to check whether a property of a predicate, {first,last,any,all}x{true,false} holds. Without any ability to express reductions as intrinsics, we need a separate instruction for the same reasons as described for regular reductions.

We should leave this for later, as this is very SVE-specific.

>         3) Other, non-vectorizer, users or LLVM that may want to generate vector code themselves. Front-ends which want to do this currently must deal with the difficulties of generating the tree shuffle pattern to ensure that they're matched to efficient instructions later.

This is a minor point, since bloated IR is "efficient" if it collapses
into a small number of instructions, for example dozens of shuffles
into a ZIP.

The argument that the intrinsic is harder to destroy through
optimisation passes is the same as other cases of stiff rich semantics
vs. generic pattern matching, so orthogonal to this issue.

> We propose to introduce the following reduction intrinsics as a starting point:
> int_vector_reduce_add(vector_src)

Is this C intrinsic? Shouldn't an IR builtin be something like:

@llvm.vector.reduce.add(...) ?

> These intrinsics do not do any type promotion of the scalar result. Architectures like SVE which can do type promotion and reduction in a single instruction can pattern match the promote->reduce sequence.

Yup.

> ...
> int_vector_reduce_fmax(vector_src, i32 NoNaNs)

A large list, and probably doesn't even cover all SVE can do, let
alone other reductions.

Why not simplify this into something like:

  %sum = add <N x float>, <N x float> %a, <N x float> %b
  %red = @llvm.reduce(%sum, float %acc)
or
  %fast_red = @llvm.reduce(%sum)

For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?

> We have multiple options for expressing vector predication in reductions:
>         1. The first is to simply add a predicate operand to the intrinsics, and require that targets without predication explicitly pattern match for an all-true predicate in order to select hardware instructions.
>         2. The second option is to mask out inactive lanes in the input vector by using a select between the input, and a vector splat of the reduction's identity values, e.g. 0.0 for fp-add.
>
> We believe option 2 will be sufficient for predication capable architectures, while keeping the intrinsics definitions simple and minimal. If there are targets for which the identity value for a reduction is different, then we could use an IR constant to express this in a generic way.

I agree. I haven't followed Elena's work on the similar concept for
Intel, but I vaguely remember we reached a similar conclusion for
AVX512.

cheers,
--renato