RFC: min/max/abs IR intrinsics

Tue Apr 28 10:22:26 PDT 2015

Hi Renato,

I actually think the discussion about indexed/strided/masked loads is completely different from this, in that indexed/strided/masked loads are fundamentally a low-level, hardware-influenced representation, and as such any discussion about it is fundamentally a lowering discussion.  What we are discussing here WRT min/max/neg/abs is about constructs that are present from the user’s source in many programming languages, and LLVM is today discarding that information in a way that inhibits the direct matching of those user constructs to their direct hardware implementations.

This seems like a completely straight forward canonicalization situation to me, and fits directly into the “ascending/descending” model of canonicalization and lowering: during the early stages of compilation we unify constructs in the IR towards an abstracted, minimally redundant form.  Then, critically, at some point (traditionally when we enter SelectionDAG, though it has been moving earlier for the last couple of years) we have reached “peak” canonicalization and begin breaking the canonical form in favor of target-optimized or target-specific constructs.

—Owen

> On Apr 28, 2015, at 9:31 AM, Renato Golin <renato.golin at linaro.org> wrote:
> 
> On 28 April 2015 at 16:53, James Molloy <james at jamesmolloy.co.uk> wrote:
>>  * Philip Reames favours late matching, where we create intrinsics late in
>> the optimization pipeline (CodeGenPrepare) and use "select" as the canonical
>> form up till that point.
>>  * Owen Anderson favours early matching, using min/max intrinsics as the
>> canonical form through most of the compiler.
>> 
>> Consensus hasn't yet been reached. Thoughts?
> 
> Hi James,
> 
> A similar discussion spawned regarding indexed / strided / masked
> memory access and the risks are the same:
> 
> * Early matching hardens the IR, stopping a lot of optimisations working
> * Late matching allows for scrambled IR (due to unaware
> optimisations), and destroy patterns
> 
> Each one is horrible in their own right, but I'll side with Philip in
> this one, in the same way I think Chandler was right about doing more
> to match complex memory accesses in pure IR, even if the patterns do
> get more complex. My reasons are two fold:
> 
> 1. I'll repeat Philip's words: Where do we stop? How many intrinsics
> are we going to add to the IR until every optimisation pass becomes a
> huge switch with all possible variations? This was the original design
> decision behind not implementing every NEON intrinsic as a builtin
> node, and I still believe Bob Wilson was right back then. It did
> generate better code.
> 
> 2. It's easier to fix the passes that destroy data, even if there are
> many of them, than to add all builtins to all passes in order to
> understand IR. I agree, doing so doesn't scale well, especially if you
> move to a dynamic execution of passes (if the pass manager ever
> supports that), but the alternative doesn't scale at all. It's
> polynomial vs. exponential. Both are bad, but exponential is worse.
> 
> In the end, for the strided loads, Hao decided to try out plain IR,
> shuffles and loads/stores. Elena will try too, for masked and indexed
> loads, and only as a last resort, we'll add those intrinsics. There
> were some added, and if possible, we should remove them if we succeed
> in matching enough patters with just IR.
> 
> I think we should do the same in this case.
> 
> cheers,
> --renato