[llvm-dev] RFC: Generic IR reductions

Tue Jan 31 15:16:57 PST 2017

+cc Simon who's also interested in reductions for the any_true,
all_true predicate vectors.

On 31 January 2017 at 20:19, Renato Golin via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> Hi Amara,
>
> We also had some discussions on the SVE side of reductions on the main
> SVE thread, but this description is much more detailed than we had
> before.
>
> I don't want to discuss specifically about SVE, as the spec is not out
> yet, but I think we can cover a lot of ground until very close to SVE
> and do the final step when we get there.

The goal of this proposal is to agree on a new representation, so we
are looking at more than SVE and improving things for LLVM as a whole.

>
>
> On 31 January 2017 at 17:27, Amara Emerson via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>>         1) As our vector lengths are unknown at compile time (and also not a power of 2), we cannot generate the same reduction IR pattern as other targets. We need a direct IR representation of the reduction in order to vectorize them. SVE also contains strict ordering FP reductions, which cannot be done using the tree-reduction.
>
> Not SVE specific, for example fast-math.

Can you explain what you mean here? Other targets may well have
ordered reductions, I can't comment on that aspect. However, fast-math
vectorization is a case where you *don't* want ordered reductions, as
the relaxed fp-contract means that the conventional tree reduction
algorithm preserves the required semantics.

>
>>         2) That we can use reduction intrinsics to implement our proposed SVE "test" instruction, intended to check whether a property of a predicate, {first,last,any,all}x{true,false} holds. Without any ability to express reductions as intrinsics, we need a separate instruction for the same reasons as described for regular reductions.
>
> We should leave this for later, as this is very SVE-specific.

As I stated earlier, this is going beyond SVE. I hope to achieve a
consensus between other targets as well, even if some don't yet have
efficient ways to handle it.

>
>>         3) Other, non-vectorizer, users or LLVM that may want to generate vector code themselves. Front-ends which want to do this currently must deal with the difficulties of generating the tree shuffle pattern to ensure that they're matched to efficient instructions later.
>
> This is a minor point, since bloated IR is "efficient" if it collapses
> into a small number of instructions, for example dozens of shuffles
> into a ZIP.

The hassle of generating reductions may well be at most a minor
motivator, but my point still stands. If a front-end wants the target
to be able to generate the best code for a reduction idiom, they must
generate a lot of IR for many-element vectors. You still have to paid
the price in bloated IR, see the tests changed as part of the AArch64
NEON patch.

> The argument that the intrinsic is harder to destroy through
> optimisation passes is the same as other cases of stiff rich semantics
> vs. generic pattern matching, so orthogonal to this issue.
>
>
>> We propose to introduce the following reduction intrinsics as a starting point:
>> int_vector_reduce_add(vector_src)
>
> Is this C intrinsic? Shouldn't an IR builtin be something like:
>
> @llvm.vector.reduce.add(...) ?

You're right, in IR assembly it would appear like that. The ones I
proposed were the tablegen syntax of the intrinsics definitions, so in
practice they would look like @llvm.vector.reduce.[operation] in IR
asm.

>
>> These intrinsics do not do any type promotion of the scalar result. Architectures like SVE which can do type promotion and reduction in a single instruction can pattern match the promote->reduce sequence.
>
> Yup.
>
>
>> ...
>> int_vector_reduce_fmax(vector_src, i32 NoNaNs)
>
> A large list, and probably doesn't even cover all SVE can do, let
> alone other reductions.
>
> Why not simplify this into something like:
>
>   %sum = add <N x float>, <N x float> %a, <N x float> %b
>   %red = @llvm.reduce(%sum, float %acc)
> or
>   %fast_red = @llvm.reduce(%sum)

Because the semantics of an operation would not depend solely in the
operand value types and operation, but on a chain of computations
forming the operands. If the input operand is a phi, you then have to
do potentially inter-block analysis. If it's a function parameter or
simply a load from memory then you're pretty much stuck and you can't
resolve the semantics.

During the dev meeting, a reductions proposal where the operation to
be performed was a kind of opcode was discussed, and rejected by the
community. I don't believe having many intrinsics would be a problem.

> For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?

For the same reasons that we don't re-use the other binary operator
instructions like add, sub, mul. The vector versions of those are not
horizontal operations, they instead produce vector results.

>
>> We have multiple options for expressing vector predication in reductions:
>>         1. The first is to simply add a predicate operand to the intrinsics, and require that targets without predication explicitly pattern match for an all-true predicate in order to select hardware instructions.
>>         2. The second option is to mask out inactive lanes in the input vector by using a select between the input, and a vector splat of the reduction's identity values, e.g. 0.0 for fp-add.
>>
>> We believe option 2 will be sufficient for predication capable architectures, while keeping the intrinsics definitions simple and minimal. If there are targets for which the identity value for a reduction is different, then we could use an IR constant to express this in a generic way.
>
> I agree. I haven't followed Elena's work on the similar concept for
> Intel, but I vaguely remember we reached a similar conclusion for
> AVX512.
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev