[llvm-dev] RFC: Generic IR reductions

Wed Feb 1 17:06:13 PST 2017

Thanks for the summary, some more comments inline.

On 1 February 2017 at 22:02, Renato Golin <renato.golin at linaro.org> wrote:
> On 1 February 2017 at 21:22, Saito, Hideki <hideki.saito at intel.com> wrote:
>> I think we are converging enough at the detail level, but having a big
>> difference in the opinions at the "vision" level.  :)
>
> Vision is always harder than engineering. :)
>
> So, I have copied a lot more people now, to make sure we get a wider
> audience. I want to give them a chance to chime in early enough.
> Folks, feel free to copy more people if you think they need to have an
> opinion.
>
> Here's a summary:
>
> Problem #1: Reductions are common, but:
>  * the code generated is large (log(lanes) operations + extracts) and
> will get "too large" with 1024/2048 bit vectors
>  * this may cause compile time slow downs (we're not sure, need data),
> and increased complexity in pattern-matching description for the
> back-end
>  * the semantics can be target-specific (ordered / non-ordered /
> non-binary reduction), but I don't know of any target that is not
> ordered/binary.
>  * constant propagation can hide the semantics half-way through, and
> the back-end loses the ability to match
Unless I'm mistaken I think the constant propagation issue was with
the proposal to split reduction into two stages, where the reduce step
depends on the instruction used to produce the input value. This is
unrelated to the current shuffle representation.
>
> Problem #2: As we move to scalable vectors (of unknown length, like SVE), we:
>  * can't generate binary extracts + ops (to mean unordered / aka
> fast), as we don't know the number of lanes
>  * we'll have to represent ordered reduction as a loop based on the
> (unknown at compile time) "compile time constant" that represents the
> vector length
>
> So, Amara's proposal converged to solve all those problems with a new
> intrinsic similar to:
>
>   %red = type-out @llvm.reduce.op.vec-type.type-out.ordered(<vec-ty> %vector)
>
> With:
>  * "op" one of { add, sub, mul, max, min, etc } x { '', 'f' } for
> integer / float
>  * vec-ty the input type, which can be a scalable type in the future
> (<n x N x type>).
>  * type-out not necessarily equal to the lane type in vec-ty, to allow
> for widening/shotening or FP to int conversions, it supported.
I didn't include this in my original proposal, although I'm not
particularly opposed to it. In the spirit of keeping the IR as simple
as possible, the widening/narrowing behaviour doesn't need to be
bundled into the reduction. A promotion can be a separate ext of the
input and I think it should be easily pattern matched.
>  * "ordered" some flag meaning FP precision is a concern (could be
> "fast" meaning the opposite, doesn't matter)
>
> The solution simplifies a lot the loop vectoriser, as well as the SLP
> and, as Amara said, could even be used by front-ends or other passes
> when the semantics is guaranteed (Haskell?).
>
> However, it also incur in the same issue we had with the other vector
> intrinsics: they stiff the IR, making other optimisations discard
> information, avoid valid transformations or even plain bad code-gen.
>
> So far, on the specific topic for reductions, the only case I could
> think of that we'd miss optimisations is when we have two compatible
> reductions and we want to merge them. For example, NEON can
> horizontally reduce one or two vectors at the same time, or we could
> have three strided reductions and then combine them all into one after
> inlining.
My implementation of this for AArch64 NEON (I've not looked at AArch32
yet) converts the generic reduce ISD nodes to AArch64 specific ones
with an extract. It doesn't try to do anything else, so once it goes
past the original tree-shuffle matching code, the DAG should be
identical.
>
> What stops us from doing so with intrinsics is just the knowledge, so
> we trade complexity in the back-end to match long patterns for
> complexity in optimisation passes to know about the new intrinsics.
>
> The arguments were:
>
> Pro intrinsic:
>  * Simpler and shorter IR
>  * Easier to read, easier for the vectorisers to generate
>  * Easier for the back-end to match
>  * Easier transition towards scalable vectors in the future
Also want to re-iterate that reductions are useful for testing bits of
predicate vectors, which can be applicable to other targets than SVE
for things like early-exit vectorization. Simon mentioned this to me
off-list. Simon, could you comment here if this proposal would work
for your needs?
> Cons:
>  * Already existing patterns work and can expose further opportunities
>  * We may have to change more middle-end passes / back-ends than would
> be affected
>
> I really have no strong opinion either way. Both are challenging in
> their own rights, and this is worth doing anyway.
>
> Did I miss something?
>
> cheers,
> --renato