RFC: min/max/abs IR intrinsics

James Molloy james at jamesmolloy.co.uk
Mon Apr 27 11:46:12 PDT 2015


Hi Owen,

I know (hope! ;)) you meant that tongue-in-cheek, but just to reiterate for
more literal minded people - I'm looking to do the right thing here, not
the quick hack. I thought we'd got consensus but it appears not!

James

On Mon, 27 Apr 2015 at 19:15 Gerolf Hoflehner <ghoflehner at apple.com> wrote:

> Sure, that sounds very reasonable. No rush from my side.
>
> Cheers
> Gerolf
>
> On Apr 27, 2015, at 8:23 AM, James Molloy <james at jamesmolloy.co.uk> wrote:
>
> Hi all,
>
> Thanks for all the comments! David, I think that was the main hurdle. With
> that instcombine fixed, the matching code becomes much simpler. I think
> there's only a couple of things to do:
>
>   * Add new intrinsics and SDNodes for [su]{min,max} -
> http://reviews.llvm.org/D9293
>   * Pull min/max matching out of InstCombine and into somewhere more
> accessible - http://reviews.llvm.org/D9294
>   * Match min/max patterns in CodeGenPrepare into intrinsics, if the
> intrinsic version is not marked "expand". - WIP, still writing tests.
>   * Enable min/max generation for AArch64 and other targets - WIP, still
> writing tests.
>
> Gerolf, I'm hoping the above will really not take long. If this gets
> approved in concept, the actual mechanics are very easy - I have two out of
> four patches ready to go at the moment and the rest are implemented, just
> needing more testing. Does it make sense to hold off your target-specific
> version for a few days and let this one go in instead, barring major
> objections?
>
> Cheers,
>
> James
>
> On Mon, 27 Apr 2015 at 02:40 Gerolf Hoflehner <ghoflehner at apple.com>
> wrote:
>
>> Hi James,
>>
>> sounds like you have a number of good reasons for the intrinsics. Having
>> them would allow a) to catch more complex pattern and b) gain more insight
>> into the impact that design on the rest of the compiler. So initially there
>> would be a mix of higher level and low level min/max recognition while the
>> design evolves and potentially addresses  eg. Phillips concerns. While all
>> this is in progress, do you think it is OK to commit the target speciifc
>> vmin/vmax patch I sent earlier?
>>
>> Cheers
>> Gerolf
>>
>> On Apr 26, 2015, at 2:15 AM, James Molloy <james at jamesmolloy.co.uk>
>> wrote:
>>
>> Hi Gerolf,
>>
>> On Sun, 26 Apr 2015 at 05:10 Gerolf Hoflehner <ghoflehner at apple.com>
>> wrote:
>>
>>> Please see below.
>>>
>>> On Apr 25, 2015, at 1:44 AM, James Molloy <james at jamesmolloy.co.uk>
>>> wrote:
>>>
>>> Hi,
>>>
>>> I've also previously gone down the lines of a target-isel approach and
>>> had a patch that pretty much worked. I'll explain why I now think this is
>>> better done early in the IR. I should say though that I could be convinced
>>> to go back to the target-isel approach if there are better reasons :)
>>>
>>> I’m more curious to hear about your arguments and in particular what
>>> performance headroom you foresee. One of our CPU architects had pointed out
>>> that we could generate smax/umax etc., which took me longer than it should
>>> have. So I thought about doing it at a higher level also, looked around,
>>> looked again, scratched my head and relatively quickly talked myself into
>>> something  like 'lot of work, little headroom, target level support good
>>> enough' :-)
>>>
>>
>> The performance improvements I see and the rationale for this is where
>> you have multiple mins/maxs in series. Several benchmarks I have such as
>> image filters use min/max to implement clamp(); obviously image filters
>> tend to be very small so every extra instruction counts. Similarly another
>> image filter uses a ternary min (min(a,b,c)), which ends up as min/min.
>> That currently doesn't get matched either.
>>
>> I can see min/max reductions as being easier to identify (and either
>> vectorize or reassociate, both having big speedup potential). Finally and
>> perhaps the biggest headroom is in saturation. Matching the idiom clamp(x,
>> 0, 255), you can generate a VQMOV. Even better, if the datatype is then
>> shrunk as it often is after such a clamp, you end up with one VQMOVN.
>> That's 5 instructions -> 1 instruction.
>>
>> Obviously the effect this would have depends on the domain - image
>> filters (such as sharpen from Geekbench) are where this is really going to
>> give dividends. But even for non-image-filters, there's still a hygiene
>> factor where we really should be generating the right thing even if we
>> don't get a massive performance boost.
>>
>>
>>>
>>> 1. Matching min/max patterns can be awkward in the presence of
>>> constants. Consider the idiomatic max(x, 0) with a conversion after it:
>>>
>>>   %1 = fcmp ogt float 0.0, float %x
>>>   %2 = select i1 %1, float %x, float 0.0
>>>   %3 = fptoui float %2 to i32
>>>
>>> InstCombine will push the fptoui up:
>>>
>>>   %1 = fcmp ogt float 0.0, float %x
>>>   %2 = fptoui float %x to i32
>>>   %3 = select i1 %1, i32 %2, i32 0
>>>
>>> Now, you have a much more complex pattern to match. You have to identify
>>> that "i32 0" is the same as "float 0.0" with "fptoui" applied to it. This
>>> gets even more complex when you have nested mins and maxes, for example in
>>> "clamp(x, 0, 255)". It was this that I was having difficulty efficiently
>>> and reliably matching at the ISel level, and what convinced me to move
>>> higher up the pipe where at least testing was easier.
>>>
>>> Now actually, we can do even better in IR and move further still up the
>>> pipe, and run before InstCombine. Then, we don't need any complex matching
>>> at all and everything is simpler.
>>>
>>> I think of this as short-coming of inst combine. It could convert the
>>> float to int before the cmp resulting in int cmp/select. Since this an
>>> example for fp -> int conversion, here is another thought:  It seems that
>>> this fmin/max -> imin/max conversion is always the right thing to do, and
>>> the IR level is appropriate, but in some cases that depends on the HW. For
>>> example, if there are two instances of that pattern one might want to
>>> convert one to fp and one to int so both can execute in parallel (assuming
>>> the HW provides the proper resources).
>>>
>>
>> Yes, I think that's right, and I was going to control this with a target
>> hook. I'm seriously considering whether my pass might not be better written
>> as an instcombine rather than a standalone pass... haven't come to any
>> conclusions yet.
>>
>>
>>>
>>> The issue with complex pattern recognition at the ISEL level: does this
>>> point to a general weakness that also effect over pattern, or is this an
>>> isolated instance min/max at IR level will take care of?
>>>
>>> I think it's generally that while there are one or two patterns you'd
>> need to match at IR level, because there are more, specialized SDAG nodes
>> the problem becomes larger at that point. Also, it just stands to reason
>> that if you have a hard-to-match sequence, the longer you keep it around
>> unmatched the higher the likelihood of it being destroyed when you want to
>> match it.
>>
>>>
>>> 2. As I've mentioned, this approach allows anything using the cost model
>>> to be more accurate. This currently is just the vectorizers but I see other
>>> users coming along soon like loop distribution and loop fusion.
>>>
>>> Would be good if you could share examples where the difference matters.
>>> I expect we’ll have to look into the cost model much more closely soon.
>>>
>>> I don't have explicit examples right now. I'm more thinking of hygiene
>> factor - making the model more accurate is a good thing regardless of
>> immediate benefit. But it should also help the register pressure estimation
>> code get a better result.
>>
>>>
>>> 3. This simplifies pattern matching in backends with not just min/max
>>> support, but also with explicit "clamp" support such as GPUs. GPUs will
>>> often have (almost free) clamp_0_255 instructions or modifiers as this is
>>> used extremely often in image filters and shaders.
>>>
>>> Ok, I take your word for it.
>>>
>>>
>>> 4. Doing this at the IR level is immediately useful for all backends,
>>> not just (one of) AArch64 or AArch32.
>>>
>>> Ok, although I expect all targets for which it matters have that support.
>>>
>>> Right, although demonstrably it doesn't work at the moment, at least not
>> for AArch64 and AArch32. My experiments have shown that X86 isn't fantastic
>> either. Isolated mins/maxes, yes. Coupled mins/maxes? no.
>>
>>>
>>> 5. To make sure the clamp() idiom got matched, I upped the amount that
>>> SimplifyCFG would speculate (in r229099). This caused some slight
>>> controversy - it is still in-tree but there is still the potential of
>>> pulling it out if it is found that it was too aggressive. The IR approach
>>> means we don't actually need it and can pattern match across basic blocks,
>>> which we can't do in ISel (and it'd be too late by then because all the
>>> vectorizers will have bailed out).
>>>
>>>
>>> That is the strongest argument I see so far. There is a dependence
>>> between optimizations. And the example will come where there is just not
>>> enough speculation in SimplifyCFG for the next pattern to get recognized.
>>>
>> Right. min+min+min (horizontal reduce) won't currently get speculated.
>>
>>
>>>
>>> As you've said Gerolf, we have patchy max/min/abs matching and handling
>>> in many places in codegen; it would be nice if this approach could remove
>>> them all. My only worry is that at the moment, if a target does not support
>>> the new intrinsics they will be expanded, but will be expanded only after
>>> the DAG is type-legalized and so may miss out on some scalar optimizations.
>>> I'm not sure how well-founded this worry is though.
>>>
>>> Surprises will happen anyway with a (re-) design like that. Just the
>>> scraps to deal with when they happen.
>>>
>>> Would the new design make most the HW specific int/fp/vector intrinsics
>>> in ./inc
>>>
>> lude/llvm/IR redundant?
>>>
>>>
>>> I certainly expect the new intrinsics to map directly to the smin/umin
>> NEON intrinsics in the A32 and A64 backends. It'd be nice if that was true
>> for other architectures too but I just don't know.
>>
>>> Cheers,
>>>
>>> James
>>>
>>>
>>> On Sat, 25 Apr 2015 at 04:19 Owen Anderson <resistor at mac.com> wrote:
>>>
>>>>
>>>> On Apr 24, 2015, at 2:19 PM, Gerolf Hoflehner <ghoflehner at apple.com>
>>>> wrote:
>>>>
>>>> I was wondering if there were a performance advantage to have a higher
>>>> level representation of min/max. My hunch is that recognizing the idiom at
>>>> the low-level is ‘good enough’.
>>>>
>>>>
>>>> I suspect (aside from vectorization) the impact to be minimal for
>>>> integer min/max.  For floating point, I expect there to be greater upside
>>>> because of ISA level support, and because the SW emulation expansion is
>>>> typically complex.
>>>>
>>>> —Owen
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150427/362c62dd/attachment.html>


More information about the llvm-commits mailing list