[llvm-dev] RFC: Generic IR reductions

Tue Jan 31 09:27:22 PST 2017

Hi all,

During the Nov 2016 dev meeting, we had a hackers’ lab session where we discussed some issues about loop idiom recognition, IR representation and cost modelling. I took an action to write up an RFC about introducing reduction intrinsics to LLVM to represent horizontal operations across vectors.

Vector reductions have been discussed in the past before, notably here: http://lists.llvm.org/pipermail/llvm-dev/2015-November/092379.html

This was in the context of recognizing the patterns for matching more complex cases like SAD & dot product. These discussions have centered around adding a new vector reduce instruction or annotating the reduction PHI to allow pattern recognition later at codegen. However, these motivations are different to the ones I'm giving here.

Current status
==============

Today scalar loop reductions are recognized in IR through the LoopUtils isReductionPHI utility, and this creates a recurrence descriptor after doing some analysis. The descriptor gives us information about what kind of reduction it is, e.g. int add, int or, fp add, fp mul, minmax etc)

The vectorizer generates vector versions of these as a final reducing step after the vector loop body, doing a tree reduction across all the lanes. This has multiple disadvantages in being complex and difficult to generate for other passes in LLVM and front-ends. The reduction using shuffles then have to be matched in the backend. For ARM’s Scalable Vector Extensions (SVE), there are some more fundamental reasons that requires us to move away from this approach;

	1) As our vector lengths are unknown at compile time (and also not a power of 2), we cannot generate the same reduction IR pattern as other targets. We need a direct IR representation of the reduction in order to vectorize them. SVE also contains strict ordering FP reductions, which cannot be done using the tree-reduction.
	2) That we can use reduction intrinsics to implement our proposed SVE "test" instruction, intended to check whether a property of a predicate, {first,last,any,all}x{true,false} holds. Without any ability to express reductions as intrinsics, we need a separate instruction for the same reasons as described for regular reductions.
	3) Other, non-vectorizer, users or LLVM that may want to generate vector code themselves. Front-ends which want to do this currently must deal with the difficulties of generating the tree shuffle pattern to ensure that they're matched to efficient instructions later.

Proposal
========

We propose to introduce the following reduction intrinsics as a starting point:
int_vector_reduce_add(vector_src)
int_vector_reduce_fadd(vector_src)
int_vector_reduce_and(vector_src)
int_vector_reduce_or(vector_src)
int_vector_reduce_eor(vector_src)
int_vector_reduce_mul(vector_src)

These intrinsics do not do any type promotion of the scalar result. Architectures like SVE which can do type promotion and reduction in a single instruction can pattern match the promote->reduce sequence.

And for minmax recurrence types, we have:
int_vector_reduce_smax(vector_src)
int_vector_reduce_smin(vector_src)
int_vector_reduce_umax(vector_src)
int_vector_reduce_umin(vector_src)
int_vector_reduce_fmin(vector_src, i32 NoNaNs)
int_vector_reduce_fmax(vector_src, i32 NoNaNs)

These reduction operations can be mapped from the recurrence kinds defined in LoopUtils, however any front-end or pass in LLVM may use them.

Predication
===========

We have multiple options for expressing vector predication in reductions:
	1. The first is to simply add a predicate operand to the intrinsics, and require that targets without predication explicitly pattern match for an all-true predicate in order to select hardware instructions.
	2. The second option is to mask out inactive lanes in the input vector by using a select between the input, and a vector splat of the reduction's identity values, e.g. 0.0 for fp-add.

We believe option 2 will be sufficient for predication capable architectures, while keeping the intrinsics definitions simple and minimal. If there are targets for which the identity value for a reduction is different, then we could use an IR constant to express this in a generic way.

All the intrinsics above take a single vector operand and reduce to a scalar of the same type as the vector element.

Ordered reductions
===================

SVE introduces the capability to execute reductions while maintaining the sequential order, as a result it can vectorize floating point addition reductions without requiring fast-math to be enabled. This uses a different approach to the fast reductions; instead of doing a partial reduction in the vector loop body and a final reduce to scalar in the loop exit block, we reduce to a scalar on every iteration in the vector loop. This requires an extra scalar accumulator operand to be passed into the reduction intrinsic. We propose a generic intrinsics to handle this as well:

int_vector_reduce_ordered_fadd([float|double] %acc, <vec [float|double]> %input)

We can use the same select-on-input predication scheme with these as we do with the normal reductions.

Prototype patches
=================

I've developed some initial patches as a proof of concept that should allow targets to opt in to using these new reductions with fairly minimal effort. The patches enable the use of generic reduction intrinsics for the AArch64 target, and as far as I can tell don't affect codegen in any way. The AArch64 patch cleans up a lot of the old code which was written specifically to match various forms of the log2 shuffle pattern, and therefore the tests are significantly reduced in size.

The target independent changes introduce a new transitional API documented below, by moving some duplicated code into VectorUtils (to generate shuffle patterns for targets which don't opt-in). Some basic legalization is implemented, but only enough to support the cases required by AArch64, there may be some things I've missed that are exposed by other targets.

Transitional API
================

To support both the old-style tree reduction patterns and the new intrinsics for clients, we can use a generalization of the TTI call we introduce in our SVE patches.

We could provide something like:

bool TargetTransformInfo::useReductionIntrinsic(unsigned Opcode, Type *Ty,
                                                bool IsMaxOp, bool IsSigned,
                                                bool NoNaN)

In the prototype patch, I've created two separate ways to generate reductions from an IR generator's perspective, depending on whether the client has a reduction descriptor (obtained from a LoopUtils analysis of the loop's PHIs). If we want to generate from a descriptor:
Value *llvm::createTargetReduction(IRBuilder<> &Builder,
                                   const TargetTransformInfo *TTI,
                                   RecurrenceDescriptor &Desc, Value *Src,
                                   bool InOrder, bool NoNaN)

Or if no descriptor is given, then an alternative API call:

Value *llvm::createTargetReduction(IRBuilder<> &Builder,
                                   const TargetTransformInfo *TTI,
                                   unsigned Opcode, Value *Src) {

By default, these two interfaces would generate a tree reduction for targets which do not opt-in to the new intrinsic representation.

What does everyone think of this, do these new forms seem reasonable?

Amara

-------------- next part --------------
A non-text attachment was scrubbed...
Name: reductions-enable-aarch64.diff
Type: application/octet-stream
Size: 79807 bytes
Desc: reductions-enable-aarch64.diff
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/dff98280/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reductions-ir.diff
Type: application/octet-stream
Size: 33375 bytes
Desc: reductions-ir.diff
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170131/dff98280/attachment-0003.obj>