[llvm-dev] RFC: Generic IR reductions

Wed Feb 1 14:02:50 PST 2017

On 1 February 2017 at 21:22, Saito, Hideki <hideki.saito at intel.com> wrote:
> I think we are converging enough at the detail level, but having a big
> difference in the opinions at the "vision" level.  :)

Vision is always harder than engineering. :)

So, I have copied a lot more people now, to make sure we get a wider
audience. I want to give them a chance to chime in early enough.
Folks, feel free to copy more people if you think they need to have an
opinion.

Here's a summary:

Problem #1: Reductions are common, but:
 * the code generated is large (log(lanes) operations + extracts) and
will get "too large" with 1024/2048 bit vectors
 * this may cause compile time slow downs (we're not sure, need data),
and increased complexity in pattern-matching description for the
back-end
 * the semantics can be target-specific (ordered / non-ordered /
non-binary reduction), but I don't know of any target that is not
ordered/binary.
 * constant propagation can hide the semantics half-way through, and
the back-end loses the ability to match

Problem #2: As we move to scalable vectors (of unknown length, like SVE), we:
 * can't generate binary extracts + ops (to mean unordered / aka
fast), as we don't know the number of lanes
 * we'll have to represent ordered reduction as a loop based on the
(unknown at compile time) "compile time constant" that represents the
vector length

So, Amara's proposal converged to solve all those problems with a new
intrinsic similar to:

  %red = type-out @llvm.reduce.op.vec-type.type-out.ordered(<vec-ty> %vector)

With:
 * "op" one of { add, sub, mul, max, min, etc } x { '', 'f' } for
integer / float
 * vec-ty the input type, which can be a scalable type in the future
(<n x N x type>).
 * type-out not necessarily equal to the lane type in vec-ty, to allow
for widening/shotening or FP to int conversions, it supported.
 * "ordered" some flag meaning FP precision is a concern (could be
"fast" meaning the opposite, doesn't matter)

The solution simplifies a lot the loop vectoriser, as well as the SLP
and, as Amara said, could even be used by front-ends or other passes
when the semantics is guaranteed (Haskell?).

However, it also incur in the same issue we had with the other vector
intrinsics: they stiff the IR, making other optimisations discard
information, avoid valid transformations or even plain bad code-gen.

So far, on the specific topic for reductions, the only case I could
think of that we'd miss optimisations is when we have two compatible
reductions and we want to merge them. For example, NEON can
horizontally reduce one or two vectors at the same time, or we could
have three strided reductions and then combine them all into one after
inlining.

What stops us from doing so with intrinsics is just the knowledge, so
we trade complexity in the back-end to match long patterns for
complexity in optimisation passes to know about the new intrinsics.

The arguments were:

Pro intrinsic:
 * Simpler and shorter IR
 * Easier to read, easier for the vectorisers to generate
 * Easier for the back-end to match
 * Easier transition towards scalable vectors in the future

Cons:
 * Already existing patterns work and can expose further opportunities
 * We may have to change more middle-end passes / back-ends than would
be affected

I really have no strong opinion either way. Both are challenging in
their own rights, and this is worth doing anyway.

Did I miss something?

cheers,
--renato