[llvm-dev] RFC: Generic IR reductions

Wed Feb 1 00:27:28 PST 2017

+Elena, as she has done a lot of work for AVX512, which has similar concepts.

On 31 January 2017 at 23:16, Amara Emerson <amara.emerson at gmail.com> wrote:
>> Not SVE specific, for example fast-math.
>
> Can you explain what you mean here? Other targets may well have
> ordered reductions, I can't comment on that aspect. However, fast-math
> vectorization is a case where you *don't* want ordered reductions, as
> the relaxed fp-contract means that the conventional tree reduction
> algorithm preserves the required semantics.

That's my point. Fast-math can change the target's semantics regarding
reductions, independently of scalable vectors, so it's worth
discussing in a more general case, ie in this thread.

Sorry for being terse.

> The hassle of generating reductions may well be at most a minor
> motivator, but my point still stands. If a front-end wants the target
> to be able to generate the best code for a reduction idiom, they must
> generate a lot of IR for many-element vectors. You still have to paid
> the price in bloated IR, see the tests changed as part of the AArch64
> NEON patch.

This is a completely orthogonal discussion, as I stated here:

>> The argument that the intrinsic is harder to destroy through
>> optimisation passes is the same as other cases of stiff rich semantics
>> vs. generic pattern matching, so orthogonal to this issue.

One that we have had multiple times and the usual consensus is: if it
can be represented in plain IR, it must. Adding multiple semantics for
the same concept, especially stiff ones like builtins, adds complexity
to the optimiser.

Regardless of the merits in this case, builtins should only be
introduced IFF there is no other way. So first we should discuss
adding it to IR with generic concepts, just like we did for
scatter/gather and strided access.

>> Why not simplify this into something like:
>>
>>   %sum = add <N x float>, <N x float> %a, <N x float> %b
>>   %red = @llvm.reduce(%sum, float %acc)
>> or
>>   %fast_red = @llvm.reduce(%sum)
>
> Because the semantics of an operation would not depend solely in the
> operand value types and operation, but on a chain of computations
> forming the operands. If the input operand is a phi, you then have to
> do potentially inter-block analysis. If it's a function parameter or
> simply a load from memory then you're pretty much stuck and you can't
> resolve the semantics.

I think you have just described the pattern matching algorithm,
meaning it's possible to write that in a sequence of IR instructions,
thus using add+reduce should work. Same with pointer types and other
reduction operations.

If the argument comes from a function parameter that is a non-strict
pointer to memory, then all bets are off anyway and the front-end
wouldn't be able to generate anything more specific, unless you're
using SIMD intrinsics, in which case this point is moot.

> During the dev meeting, a reductions proposal where the operation to
> be performed was a kind of opcode was discussed, and rejected by the
> community.

Well, that was certainly a smaller group than the list. Design
decisions should not be taken off list, so we must have this
discussion on the list again, I'm afraid.

> I don't believe having many intrinsics would be a problem.

This is against every decision I remember. Saying it out loud in a
meeting is one thing, writing them down and implementing and having to
bear the maintenance costs is another entirely.

That's why the consensus has to happen on the list.

>> For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?
>
> For the same reasons that we don't re-use the other binary operator
> instructions like add, sub, mul. The vector versions of those are not
> horizontal operations, they instead produce vector results.

Sorry, I meant min/max + reduce, just like above.

  %sum = add <N x float>, <N x float> %a, <N x float> %b
  %min = @llvm.minnum(<N x float> %sum)
   %red = @llvm.reduce(%min, float %acc)

cheers,
--renato