[llvm-dev] RFC: Generic IR reductions

Wed Feb 1 12:27:39 PST 2017

Hi All,

Renato wrote:
>As they say: if it's not broken, don't fix it.
>Let's talk about the reductions that AVX512 and SVE can't handle with IR semantics, but let's not change the current IR semantics for no reason.

Main problem for SVE: We can't write straight-line IR instruction sequence for reduction last value compute, without
knowing #elements in vector to start from.

For non-scalable vector, series of "reduce to half" (or any other straight-line IR instruction sequence) is functionally
working, but certainly not ideal (ugly, optimal sequence is target dependent, artificially inflating the outgoing IL
for any optimizers not interested in optimizing reduction last value compute, that translates to longer compile time,
etc.)

So, it's not like we don't have reasons for asking to change the status quo.

Here's what I'd like to see at the end of this discussion.
     Nice and concise representation of reducing a vector into a scalar value.
     An IR instruction to do so is ideal, but I understand that the hurdle for that is very high.
     I'm okay with an intrinsic function call, and I heard that's a reasonable step to get to instruction.
     Let's say someone comes up with 1024bit vector working on char data. Nobody is really happy to see
     a sequence of "reduce to half" for 128 elements. Today, with AVX512BW, we already have the problem
     of half that size (only a few instructions less). I don't think anything that is proportional to "LOG(#elems)"
     is "nice and concise".  
Such a representation is also useful inside of vectorized loop if the programmer wants bitwise identical
FP reduction value (obviously at the much slower speed), without losing the vectorization of rest of
the loop body. So, this isn't just "outside the vectorized loop" stuff.

Now, can we focus on the value of "nice and concise representation"? If the community consensus
is "Yes", we can then talk about how to deliver that vision --- one IR instruction, one intrinsic call, a small
number of IR instructions (I don't know how, but someone might have brilliant ideas), etc., and further dig
deeper (e.g., IR instruction for each reduction operator?, whether the operator specification is part of
operand or part of intrinsic name?, etc). If the community consensus is "No", we can stop talking about
details right away.

Obviously, I'm voting for YES on "nice and concise representation".

Thanks,
Hideki Saito
Intel Compilers and Languages

------------------------------

Date: Wed, 1 Feb 2017 15:10:36 +0000
From: Renato Golin via llvm-dev <llvm-dev at lists.llvm.org>
To: Amara Emerson <amara.emerson at gmail.com>
Cc: "llvm-dev at lists.llvm.org" <llvm-dev at lists.llvm.org>, nd
	<nd at arm.com>
Subject: Re: [llvm-dev] RFC: Generic IR reductions
Message-ID:
	<CAMSE1keZFe+EGTZ6zu69J+953Y87jnMKKZuHbR-P63sEjE0Ymw at mail.gmail.com>
Content-Type: text/plain; charset=UTF-8

On 1 February 2017 at 14:44, Amara Emerson <amara.emerson at gmail.com> wrote:
> Her point is that the %sum value will no longer be an add node, but 
> simply a constant vector. There is no way to resolve the semantics and 
> have meaningful IR after a simple constant prop. This means that other 
> simple transformations will all have to have special case logic just 
> to handle reductions, for example LCSSA.

Right.

> Can you give a specific example? Reductions are actually very robust 
> to optimizations breaking the idiom, which is why I was able to 
> replace the reductions with intrinsics in my patches and with simple 
> matching generate identical code to before. No other changes were 
> required in the mid-end.

First, this is a demonstration that keeping them as IR is a good thing, not a bad one. We keep the semantics as well as allow for further introspection in the code block.

Examples of introspection are vector widening or instruction fusion after inlining. Say you have a loop with a function call and a reduction, but that call has a reduction on its own. If the function gets inlined after you reduced your loop into a builtin, the optimiser will have no visibility if the function's reduction pattern can be merged, either widening the vectors (saving on loads/stores) or fusing (ex. MLA).

We have seen both cases with NEON after using IR instructions for everything but the impossible.

In 2010, I've gone through a similar discussion with Bob Wilson, who defended the position I'm defending now. And I defend this position today because I have been categorically proven wrong by the results I describe above.

Chandler's arguments are perfectly to the point. Intrinsics are not only necessary when we can't represent things in IR, they're a *good* ways of representing odd things.

But if things are not odd (ie. many targets have them) or if we can already represent them in IR, then it stands to reason that adding duplicated stuff only adds complexity. It increases the maintenance cost (more node types to consider), in increases the chance for missing some of them (and either not optimising or generating bad code), and it stops the optimisers that know nothing about it (because it's too new) to do any inference, and can actually generate worse code than before (shuffle explosion).

As they say: if it's not broken, don't fix it.

Let's talk about the reductions that AVX512 and SVE can't handle with IR semantics, but let's not change the current IR semantics for no reason.

cheers,
--renato