RFC: min/max/abs IR intrinsics

Mon Apr 27 14:04:59 PDT 2015

Can we agree that late matching is better than what we have now?  If so, 
I would suggest we implement that and then reassess.  (I'm assuming that 
the work for late matching isn't material enough that "wasting" the 
effort if we do decide to move towards early matching is a problem.)

Owen, I disagree with you.  Beyond the engineering complexity part - 
which is a very real concern - I do not believe that min/max is the 
*right* canonicalization.  If a comparison is loop invariant, I really 
really want the select to be a standard unswitch idiom rather than 
having to teach the unswitcher to reason about an entirely new class of 
control flow (that embedded within another intrinsic.) (LoopUnswitch is 
used only as an example here.  The point applies to any control flow 
aware transform.)

If we support min/max as canonical form, where do we stop?  ABS? NABS?  
SAD?  Saturating arithmetic?  Overflow intrinsics?  I don't see how 
min/max are any different than the others.

Again, we need to move this discussion to llvm-dev.  This has grown 
beyond the scope of a patch review.

Philip

On 04/27/2015 10:32 AM, Owen Anderson wrote:
> I’m going to go say that I strongly *disagree* with the late matching 
> concept.  This seems like a clear-cut canonicalization vs lowering 
> situation.  Matching min/max operations seems like a straight-forward 
> canonicalization, in that many programs that are semantically 
> equivalent but represented by slightly different IR patterns would be 
> unified after min/max matching.  The issue of, for example, ICMP 
> reuse, seems like a lowering detail, since the relevance of that 
> transformation is dependent on whether the target does or does not 
> have fused min/max instructions.
>
> —Owen
>
>
>> On Apr 27, 2015, at 8:23 AM, James Molloy <james at jamesmolloy.co.uk 
>> <mailto:james at jamesmolloy.co.uk>> wrote:
>>
>> Hi all,
>>
>> Thanks for all the comments! David, I think that was the main hurdle. 
>> With that instcombine fixed, the matching code becomes much simpler. 
>> I think there's only a couple of things to do:
>>
>>   * Add new intrinsics and SDNodes for [su]{min,max} - 
>> http://reviews.llvm.org/D9293
>>   * Pull min/max matching out of InstCombine and into somewhere more 
>> accessible - http://reviews.llvm.org/D9294
>>   * Match min/max patterns in CodeGenPrepare into intrinsics, if the 
>> intrinsic version is not marked "expand". - WIP, still writing tests.
>>   * Enable min/max generation for AArch64 and other targets - WIP, 
>> still writing tests.
>>
>> Gerolf, I'm hoping the above will really not take long. If this gets 
>> approved in concept, the actual mechanics are very easy - I have two 
>> out of four patches ready to go at the moment and the rest are 
>> implemented, just needing more testing. Does it make sense to hold 
>> off your target-specific version for a few days and let this one go 
>> in instead, barring major objections?
>>
>> Cheers,
>>
>> James
>>
>> On Mon, 27 Apr 2015 at 02:40 Gerolf Hoflehner <ghoflehner at apple.com 
>> <mailto:ghoflehner at apple.com>> wrote:
>>
>>     Hi James,
>>
>>     sounds like you have a number of good reasons for the intrinsics.
>>     Having them would allow a) to catch more complex pattern and b)
>>     gain more insight into the impact that design on the rest of the
>>     compiler. So initially there would be a mix of higher level and
>>     low level min/max recognition while the design evolves and
>>     potentially addresses  eg. Phillips concerns. While all this is
>>     in progress, do you think it is OK to commit the target speciifc
>>     vmin/vmax patch I sent earlier?
>>
>>     Cheers
>>     Gerolf
>>
>>>     On Apr 26, 2015, at 2:15 AM, James Molloy
>>>     <james at jamesmolloy.co.uk <mailto:james at jamesmolloy.co.uk>> wrote:
>>>
>>>     Hi Gerolf,
>>>
>>>     On Sun, 26 Apr 2015 at 05:10 Gerolf Hoflehner
>>>     <ghoflehner at apple.com <mailto:ghoflehner at apple.com>> wrote:
>>>
>>>         Please see below.
>>>
>>>>         On Apr 25, 2015, at 1:44 AM, James Molloy
>>>>         <james at jamesmolloy.co.uk <mailto:james at jamesmolloy.co.uk>>
>>>>         wrote:
>>>>
>>>>         Hi,
>>>>
>>>>         I've also previously gone down the lines of a target-isel
>>>>         approach and had a patch that pretty much worked. I'll
>>>>         explain why I now think this is better done early in the
>>>>         IR. I should say though that I could be convinced to go
>>>>         back to the target-isel approach if there are better reasons :)
>>>         I’m more curious to hear about your arguments and in
>>>         particular what performance headroom you foresee. One of our
>>>         CPU architects had pointed out that we could generate
>>>         smax/umax etc., which took me longer than it should have. So
>>>         I thought about doing it at a higher level also, looked
>>>         around, looked again, scratched my head and relatively
>>>         quickly talked myself into something  like 'lot of work,
>>>         little headroom, target level support good enough' :-)
>>>
>>>
>>>     The performance improvements I see and the rationale for this is
>>>     where you have multiple mins/maxs in series. Several benchmarks
>>>     I have such as image filters use min/max to implement clamp();
>>>     obviously image filters tend to be very small so every extra
>>>     instruction counts. Similarly another image filter uses a
>>>     ternary min (min(a,b,c)), which ends up as min/min. That
>>>     currently doesn't get matched either.
>>>
>>>     I can see min/max reductions as being easier to identify (and
>>>     either vectorize or reassociate, both having big speedup
>>>     potential). Finally and perhaps the biggest headroom is in
>>>     saturation. Matching the idiom clamp(x, 0, 255), you can
>>>     generate a VQMOV. Even better, if the datatype is then shrunk as
>>>     it often is after such a clamp, you end up with one VQMOVN.
>>>     That's 5 instructions -> 1 instruction.
>>>
>>>     Obviously the effect this would have depends on the domain -
>>>     image filters (such as sharpen from Geekbench) are where this is
>>>     really going to give dividends. But even for non-image-filters,
>>>     there's still a hygiene factor where we really should be
>>>     generating the right thing even if we don't get a massive
>>>     performance boost.
>>>
>>>>
>>>>         1. Matching min/max patterns can be awkward in the presence
>>>>         of constants. Consider the idiomatic max(x, 0) with a
>>>>         conversion after it:
>>>>         %1 = fcmp ogt float 0.0, float %x
>>>>         %2 = select i1 %1, float %x, float 0.0
>>>>         %3 = fptoui float %2 to i32
>>>>
>>>>         InstCombine will push the fptoui up:
>>>>
>>>>         %1 = fcmp ogt float 0.0, float %x
>>>>         %2 = fptoui float %x to i32
>>>>         %3 = select i1 %1, i32 %2, i32 0
>>>>         Now, you have a much more complex pattern to match. You
>>>>         have to identify that "i32 0" is the same as "float 0.0"
>>>>         with "fptoui" applied to it. This gets even more complex
>>>>         when you have nested mins and maxes, for example in
>>>>         "clamp(x, 0, 255)". It was this that I was having
>>>>         difficulty efficiently and reliably matching at the ISel
>>>>         level, and what convinced me to move higher up the pipe
>>>>         where at least testing was easier.
>>>>
>>>>         Now actually, we can do even better in IR and move further
>>>>         still up the pipe, and run before InstCombine. Then, we
>>>>         don't need any complex matching at all and everything is
>>>>         simpler.
>>>         I think of this as short-coming of inst combine. It could
>>>         convert the float to int before the cmp resulting in int
>>>         cmp/select. Since this an example for fp -> int conversion,
>>>         here is another thought:  It seems that this fmin/max ->
>>>         imin/max conversion is always the right thing to do, and the
>>>         IR level is appropriate, but in some cases that depends on
>>>         the HW. For example, if there are two instances of that
>>>         pattern one might want to convert one to fp and one to int
>>>         so both can execute in parallel (assuming the HW provides
>>>         the proper resources).
>>>
>>>
>>>     Yes, I think that's right, and I was going to control this with
>>>     a target hook. I'm seriously considering whether my pass might
>>>     not be better written as an instcombine rather than a standalone
>>>     pass... haven't come to any conclusions yet.
>>>
>>>
>>>         The issue with complex pattern recognition at the ISEL
>>>         level: does this point to a general weakness that also
>>>         effect over pattern, or is this an isolated instance min/max
>>>         at IR level will take care of?
>>>
>>>     I think it's generally that while there are one or two patterns
>>>     you'd need to match at IR level, because there are more,
>>>     specialized SDAG nodes the problem becomes larger at that point.
>>>     Also, it just stands to reason that if you have a hard-to-match
>>>     sequence, the longer you keep it around unmatched the higher the
>>>     likelihood of it being destroyed when you want to match it.
>>>
>>>>
>>>>         2. As I've mentioned, this approach allows anything using
>>>>         the cost model to be more accurate. This currently is just
>>>>         the vectorizers but I see other users coming along soon
>>>>         like loop distribution and loop fusion.
>>>         Would be good if you could share examples where the
>>>         difference matters. I expect we’ll have to look into the
>>>         cost model much more closely soon.
>>>
>>>     I don't have explicit examples right now. I'm more thinking of
>>>     hygiene factor - making the model more accurate is a good thing
>>>     regardless of immediate benefit. But it should also help the
>>>     register pressure estimation code get a better result.
>>>
>>>>
>>>>         3. This simplifies pattern matching in backends with not
>>>>         just min/max support, but also with explicit "clamp"
>>>>         support such as GPUs. GPUs will often have (almost free)
>>>>         clamp_0_255 instructions or modifiers as this is used
>>>>         extremely often in image filters and shaders.
>>>         Ok, I take your word for it.
>>>
>>>>
>>>>         4. Doing this at the IR level is immediately useful for all
>>>>         backends, not just (one of) AArch64 or AArch32.
>>>         Ok, although I expect all targets for which it matters have
>>>         that support.
>>>
>>>     Right, although demonstrably it doesn't work at the moment, at
>>>     least not for AArch64 and AArch32. My experiments have shown
>>>     that X86 isn't fantastic either. Isolated mins/maxes, yes.
>>>     Coupled mins/maxes? no.
>>>
>>>>
>>>>         5. To make sure the clamp() idiom got matched, I upped the
>>>>         amount that SimplifyCFG would speculate (in r229099). This
>>>>         caused some slight controversy - it is still in-tree but
>>>>         there is still the potential of pulling it out if it is
>>>>         found that it was too aggressive. The IR approach means we
>>>>         don't actually need it and can pattern match across basic
>>>>         blocks, which we can't do in ISel (and it'd be too late by
>>>>         then because all the vectorizers will have bailed out).
>>>
>>>         That is the strongest argument I see so far. There is a
>>>         dependence between optimizations. And the example will come
>>>         where there is just not enough speculation in SimplifyCFG
>>>         for the next pattern to get recognized.
>>>
>>>     Right. min+min+min (horizontal reduce) won't currently get
>>>     speculated.
>>>
>>>
>>>>
>>>>         As you've said Gerolf, we have patchy max/min/abs matching
>>>>         and handling in many places in codegen; it would be nice if
>>>>         this approach could remove them all. My only worry is that
>>>>         at the moment, if a target does not support the new
>>>>         intrinsics they will be expanded, but will be expanded only
>>>>         after the DAG is type-legalized and so may miss out on some
>>>>         scalar optimizations. I'm not sure how well-founded this
>>>>         worry is though.
>>>         Surprises will happen anyway with a (re-) design like that.
>>>         Just the scraps to deal with when they happen.
>>>
>>>         Would the new design make most the HW specific int/fp/vector
>>>         intrinsics in ./inc
>>>
>>>         lude/llvm/IR redundant?
>>>>
>>>     I certainly expect the new intrinsics to map directly to the
>>>     smin/umin NEON intrinsics in the A32 and A64 backends. It'd be
>>>     nice if that was true for other architectures too but I just
>>>     don't know.
>>>
>>>>         Cheers,
>>>>
>>>>         James
>>>>
>>>>
>>>>         On Sat, 25 Apr 2015 at 04:19 Owen Anderson
>>>>         <resistor at mac.com <mailto:resistor at mac.com>> wrote:
>>>>
>>>>
>>>>>             On Apr 24, 2015, at 2:19 PM, Gerolf Hoflehner
>>>>>             <ghoflehner at apple.com <mailto:ghoflehner at apple.com>>
>>>>>             wrote:
>>>>>
>>>>>             I was wondering if there were a performance advantage
>>>>>             to have a higher level representation of min/max. My
>>>>>             hunch is that recognizing the idiom at the low-level
>>>>>             is ‘good enough’.
>>>>
>>>>             I suspect (aside from vectorization) the impact to be
>>>>             minimal for integer min/max. For floating point, I
>>>>             expect there to be greater upside because of ISA level
>>>>             support, and because the SW emulation expansion is
>>>>             typically complex.
>>>>
>>>>             —Owen
>>>>
>>
>
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150427/f0b00777/attachment.html>