[llvm-dev] RFC: Generic IR reductions

Wed Feb 1 12:52:17 PST 2017

On 1 February 2017 at 20:27, Saito, Hideki <hideki.saito at intel.com> wrote:
> Main problem for SVE: We can't write straight-line IR instruction sequence for reduction last value compute, without
> knowing #elements in vector to start from.

Hi Saito,

Indeed, I think there's at least consensus that, for scalable vectors,
there's no other way around.

> For non-scalable vector, series of "reduce to half" (or any other straight-line IR instruction sequence) is functionally
> working, but certainly not ideal (ugly, optimal sequence is target dependent, artificially inflating the outgoing IL
> for any optimizers not interested in optimizing reduction last value compute, that translates to longer compile time,
> etc.)

I agree it's not pretty, but IR doesn't have to be pretty. IR is a
compiler intermediate language, not a user-facing language. So, if the
representation is accurate and the compile time costs are within
noise, there are no reasons to change.

Now, you touched one subject that is worth considering: binary
reduction is target specific.

I don't know AVX, AltiVec and others too well, but I assumed there are
only two types of reduction strategies: ordered and binary reduction.
The first because of precision and the second because it's the most
logical generic step: you only need one instruction with multiple
register sizes to do any kind of reduction, and that reduces the
complexity of the ISA.

So if there are non-ordered / non-binary reductions, the
representation is, indeed, target specific, and that's a very negative
point for it. But, this only raises the case for non-ordered /
non-binary reductions to gain a builtin, not to move everything into a
builtin.

> So, it's not like we don't have reasons for asking to change the status quo.

I apologise if that's what got across. I'm not trying to shut anything
down, I'm just reiterating the reasons that I have received (and
given) in the past with regards to IR changes. We have done this for
NEON, stride access, etc. and it has worked well in general, with
exposed optimisations that we wouldn't have gotten with a builtin.

> Here's what I'd like to see at the end of this discussion.
>      Nice and concise representation of reducing a vector into a scalar value.
>      An IR instruction to do so is ideal, but I understand that the hurdle for that is very high.
>      I'm okay with an intrinsic function call, and I heard that's a reasonable step to get to instruction.

Yes. If we have an intrinsic that can be built in a generic way,
that's half-way towards an instruction. That's what I was trying to do
on my first reply, when I suggested the @llvm.reduce() builtin.

But it seems impractical to have a single one. It'll most likely be a
family: @llvm.reduce.op.typein.typeout.ordered?(%vector), with op
being add/sub/max/min/etc, type in the vector type, type out if
different for widen/shorten and ordered if FP precision is important.
It'll be harder to transform this into an instruction (mainly because
of the op), but it can be done, just need time and community support.

>      Let's say someone comes up with 1024bit vector working on char data. Nobody is really happy to see
>      a sequence of "reduce to half" for 128 elements. Today, with AVX512BW, we already have the problem
>      of half that size (only a few instructions less). I don't think anything that is proportional to "LOG(#elems)"
>      is "nice and concise".

I agree it would be ugly, but I want to make sure we're clear that
this is mostly irrelevant, unless it impacts compile time or the
complexity of pattern matching beyond reasonable.

I would expect larger vectors to have longer patterns, but I wouldn't
oppose to an intrinsic for 1024 bit vector reduction. :)

> Such a representation is also useful inside of vectorized loop if the programmer wants bitwise identical
> FP reduction value (obviously at the much slower speed), without losing the vectorization of rest of
> the loop body. So, this isn't just "outside the vectorized loop" stuff.

I'm not sure what you mean by this. Do you mean that
@llvm.reduce.add.ordered(...) would still be useful scalarised?

We can get the same ordered behaviour by introducing instruction
dependencies (%sum = %a + %b -> %sum = %um + %c, etc).

Also, we'd have to teach the middle end or all back-ends to lower that
intrinsic, even if the target has no concept of reduction in its ISA,
in the case it does appear from some other pass.

Wouldn't that be risky?

> Now, can we focus on the value of "nice and concise representation"? If the community consensus
> is "Yes", we can then talk about how to deliver that vision --- one IR instruction, one intrinsic call, a small
> number of IR instructions (I don't know how, but someone might have brilliant ideas), etc., and further dig
> deeper (e.g., IR instruction for each reduction operator?, whether the operator specification is part of
> operand or part of intrinsic name?, etc). If the community consensus is "No", we can stop talking about
> details right away.

As I said, I don't see why we need to have a concise representation at
all, unless it impacts compile time or pattern match complexity
(maintainability), or it's the only way (ex. SVE).

So I vote "it depends". :)

cheers,
--renato