RFC: min/max/abs IR intrinsics

Mon Apr 27 08:23:01 PDT 2015

Hi all,

Thanks for all the comments! David, I think that was the main hurdle. With
that instcombine fixed, the matching code becomes much simpler. I think
there's only a couple of things to do:

  * Add new intrinsics and SDNodes for [su]{min,max} -
http://reviews.llvm.org/D9293
  * Pull min/max matching out of InstCombine and into somewhere more
accessible - http://reviews.llvm.org/D9294
  * Match min/max patterns in CodeGenPrepare into intrinsics, if the
intrinsic version is not marked "expand". - WIP, still writing tests.
  * Enable min/max generation for AArch64 and other targets - WIP, still
writing tests.

Gerolf, I'm hoping the above will really not take long. If this gets
approved in concept, the actual mechanics are very easy - I have two out of
four patches ready to go at the moment and the rest are implemented, just
needing more testing. Does it make sense to hold off your target-specific
version for a few days and let this one go in instead, barring major
objections?

Cheers,

James

On Mon, 27 Apr 2015 at 02:40 Gerolf Hoflehner <ghoflehner at apple.com> wrote:

> Hi James,
>
> sounds like you have a number of good reasons for the intrinsics. Having
> them would allow a) to catch more complex pattern and b) gain more insight
> into the impact that design on the rest of the compiler. So initially there
> would be a mix of higher level and low level min/max recognition while the
> design evolves and potentially addresses  eg. Phillips concerns. While all
> this is in progress, do you think it is OK to commit the target speciifc
> vmin/vmax patch I sent earlier?
>
> Cheers
> Gerolf
>
> On Apr 26, 2015, at 2:15 AM, James Molloy <james at jamesmolloy.co.uk> wrote:
>
> Hi Gerolf,
>
> On Sun, 26 Apr 2015 at 05:10 Gerolf Hoflehner <ghoflehner at apple.com>
> wrote:
>
>> Please see below.
>>
>> On Apr 25, 2015, at 1:44 AM, James Molloy <james at jamesmolloy.co.uk>
>> wrote:
>>
>> Hi,
>>
>> I've also previously gone down the lines of a target-isel approach and
>> had a patch that pretty much worked. I'll explain why I now think this is
>> better done early in the IR. I should say though that I could be convinced
>> to go back to the target-isel approach if there are better reasons :)
>>
>> I’m more curious to hear about your arguments and in particular what
>> performance headroom you foresee. One of our CPU architects had pointed out
>> that we could generate smax/umax etc., which took me longer than it should
>> have. So I thought about doing it at a higher level also, looked around,
>> looked again, scratched my head and relatively quickly talked myself into
>> something  like 'lot of work, little headroom, target level support good
>> enough' :-)
>>
>
> The performance improvements I see and the rationale for this is where you
> have multiple mins/maxs in series. Several benchmarks I have such as image
> filters use min/max to implement clamp(); obviously image filters tend to
> be very small so every extra instruction counts. Similarly another image
> filter uses a ternary min (min(a,b,c)), which ends up as min/min. That
> currently doesn't get matched either.
>
> I can see min/max reductions as being easier to identify (and either
> vectorize or reassociate, both having big speedup potential). Finally and
> perhaps the biggest headroom is in saturation. Matching the idiom clamp(x,
> 0, 255), you can generate a VQMOV. Even better, if the datatype is then
> shrunk as it often is after such a clamp, you end up with one VQMOVN.
> That's 5 instructions -> 1 instruction.
>
> Obviously the effect this would have depends on the domain - image filters
> (such as sharpen from Geekbench) are where this is really going to give
> dividends. But even for non-image-filters, there's still a hygiene factor
> where we really should be generating the right thing even if we don't get a
> massive performance boost.
>
>
>>
>> 1. Matching min/max patterns can be awkward in the presence of constants.
>> Consider the idiomatic max(x, 0) with a conversion after it:
>>
>>   %1 = fcmp ogt float 0.0, float %x
>>   %2 = select i1 %1, float %x, float 0.0
>>   %3 = fptoui float %2 to i32
>>
>> InstCombine will push the fptoui up:
>>
>>   %1 = fcmp ogt float 0.0, float %x
>>   %2 = fptoui float %x to i32
>>   %3 = select i1 %1, i32 %2, i32 0
>>
>> Now, you have a much more complex pattern to match. You have to identify
>> that "i32 0" is the same as "float 0.0" with "fptoui" applied to it. This
>> gets even more complex when you have nested mins and maxes, for example in
>> "clamp(x, 0, 255)". It was this that I was having difficulty efficiently
>> and reliably matching at the ISel level, and what convinced me to move
>> higher up the pipe where at least testing was easier.
>>
>> Now actually, we can do even better in IR and move further still up the
>> pipe, and run before InstCombine. Then, we don't need any complex matching
>> at all and everything is simpler.
>>
>> I think of this as short-coming of inst combine. It could convert the
>> float to int before the cmp resulting in int cmp/select. Since this an
>> example for fp -> int conversion, here is another thought:  It seems that
>> this fmin/max -> imin/max conversion is always the right thing to do, and
>> the IR level is appropriate, but in some cases that depends on the HW. For
>> example, if there are two instances of that pattern one might want to
>> convert one to fp and one to int so both can execute in parallel (assuming
>> the HW provides the proper resources).
>>
>
> Yes, I think that's right, and I was going to control this with a target
> hook. I'm seriously considering whether my pass might not be better written
> as an instcombine rather than a standalone pass... haven't come to any
> conclusions yet.
>
>
>>
>> The issue with complex pattern recognition at the ISEL level: does this
>> point to a general weakness that also effect over pattern, or is this an
>> isolated instance min/max at IR level will take care of?
>>
>> I think it's generally that while there are one or two patterns you'd
> need to match at IR level, because there are more, specialized SDAG nodes
> the problem becomes larger at that point. Also, it just stands to reason
> that if you have a hard-to-match sequence, the longer you keep it around
> unmatched the higher the likelihood of it being destroyed when you want to
> match it.
>
>>
>> 2. As I've mentioned, this approach allows anything using the cost model
>> to be more accurate. This currently is just the vectorizers but I see other
>> users coming along soon like loop distribution and loop fusion.
>>
>> Would be good if you could share examples where the difference matters. I
>> expect we’ll have to look into the cost model much more closely soon.
>>
>> I don't have explicit examples right now. I'm more thinking of hygiene
> factor - making the model more accurate is a good thing regardless of
> immediate benefit. But it should also help the register pressure estimation
> code get a better result.
>
>>
>> 3. This simplifies pattern matching in backends with not just min/max
>> support, but also with explicit "clamp" support such as GPUs. GPUs will
>> often have (almost free) clamp_0_255 instructions or modifiers as this is
>> used extremely often in image filters and shaders.
>>
>> Ok, I take your word for it.
>>
>>
>> 4. Doing this at the IR level is immediately useful for all backends, not
>> just (one of) AArch64 or AArch32.
>>
>> Ok, although I expect all targets for which it matters have that support.
>>
>> Right, although demonstrably it doesn't work at the moment, at least not
> for AArch64 and AArch32. My experiments have shown that X86 isn't fantastic
> either. Isolated mins/maxes, yes. Coupled mins/maxes? no.
>
>>
>> 5. To make sure the clamp() idiom got matched, I upped the amount that
>> SimplifyCFG would speculate (in r229099). This caused some slight
>> controversy - it is still in-tree but there is still the potential of
>> pulling it out if it is found that it was too aggressive. The IR approach
>> means we don't actually need it and can pattern match across basic blocks,
>> which we can't do in ISel (and it'd be too late by then because all the
>> vectorizers will have bailed out).
>>
>>
>> That is the strongest argument I see so far. There is a dependence
>> between optimizations. And the example will come where there is just not
>> enough speculation in SimplifyCFG for the next pattern to get recognized.
>>
> Right. min+min+min (horizontal reduce) won't currently get speculated.
>
>
>>
>> As you've said Gerolf, we have patchy max/min/abs matching and handling
>> in many places in codegen; it would be nice if this approach could remove
>> them all. My only worry is that at the moment, if a target does not support
>> the new intrinsics they will be expanded, but will be expanded only after
>> the DAG is type-legalized and so may miss out on some scalar optimizations.
>> I'm not sure how well-founded this worry is though.
>>
>> Surprises will happen anyway with a (re-) design like that. Just the
>> scraps to deal with when they happen.
>>
>> Would the new design make most the HW specific int/fp/vector intrinsics
>> in ./inc
>>
> lude/llvm/IR redundant?
>>
>>
>> I certainly expect the new intrinsics to map directly to the smin/umin
> NEON intrinsics in the A32 and A64 backends. It'd be nice if that was true
> for other architectures too but I just don't know.
>
>> Cheers,
>>
>> James
>>
>>
>> On Sat, 25 Apr 2015 at 04:19 Owen Anderson <resistor at mac.com> wrote:
>>
>>>
>>> On Apr 24, 2015, at 2:19 PM, Gerolf Hoflehner <ghoflehner at apple.com>
>>> wrote:
>>>
>>> I was wondering if there were a performance advantage to have a higher
>>> level representation of min/max. My hunch is that recognizing the idiom at
>>> the low-level is ‘good enough’.
>>>
>>>
>>> I suspect (aside from vectorization) the impact to be minimal for
>>> integer min/max.  For floating point, I expect there to be greater upside
>>> because of ISA level support, and because the SW emulation expansion is
>>> typically complex.
>>>
>>> —Owen
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150427/2df55a6c/attachment.html>