RFC: min/max/abs IR intrinsics

Sat Apr 25 21:10:33 PDT 2015

Please see below.

> On Apr 25, 2015, at 1:44 AM, James Molloy <james at jamesmolloy.co.uk> wrote:
> 
> Hi,
> 
> I've also previously gone down the lines of a target-isel approach and had a patch that pretty much worked. I'll explain why I now think this is better done early in the IR. I should say though that I could be convinced to go back to the target-isel approach if there are better reasons :)
I’m more curious to hear about your arguments and in particular what performance headroom you foresee. One of our CPU architects had pointed out that we could generate smax/umax etc., which took me longer than it should have. So I thought about doing it at a higher level also, looked around, looked again, scratched my head and relatively quickly talked myself into something  like 'lot of work, little headroom, target level support good enough' :-)
> 
> 1. Matching min/max patterns can be awkward in the presence of constants. Consider the idiomatic max(x, 0) with a conversion after it:
>   
>   %1 = fcmp ogt float 0.0, float %x
>   %2 = select i1 %1, float %x, float 0.0
>   %3 = fptoui float %2 to i32
> 
> InstCombine will push the fptoui up:
> 
>   %1 = fcmp ogt float 0.0, float %x
>   %2 = fptoui float %x to i32
>   %3 = select i1 %1, i32 %2, i32 0
>   
> Now, you have a much more complex pattern to match. You have to identify that "i32 0" is the same as "float 0.0" with "fptoui" applied to it. This gets even more complex when you have nested mins and maxes, for example in "clamp(x, 0, 255)". It was this that I was having difficulty efficiently and reliably matching at the ISel level, and what convinced me to move higher up the pipe where at least testing was easier.
> 
> Now actually, we can do even better in IR and move further still up the pipe, and run before InstCombine. Then, we don't need any complex matching at all and everything is simpler.
I think of this as short-coming of inst combine. It could convert the float to int before the cmp resulting in int cmp/select. Since this an example for fp -> int conversion, here is another thought:  It seems that this fmin/max -> imin/max conversion is always the right thing to do, and the IR level is appropriate, but in some cases that depends on the HW. For example, if there are two instances of that pattern one might want to convert one to fp and one to int so both can execute in parallel (assuming the HW provides the proper resources). 

The issue with complex pattern recognition at the ISEL level: does this point to a general weakness that also effect over pattern, or is this an isolated instance min/max at IR level will take care of?

> 
> 2. As I've mentioned, this approach allows anything using the cost model to be more accurate. This currently is just the vectorizers but I see other users coming along soon like loop distribution and loop fusion.
Would be good if you could share examples where the difference matters. I expect we’ll have to look into the cost model much more closely soon.
> 
> 3. This simplifies pattern matching in backends with not just min/max support, but also with explicit "clamp" support such as GPUs. GPUs will often have (almost free) clamp_0_255 instructions or modifiers as this is used extremely often in image filters and shaders.
Ok, I take your word for it.
> 
> 4. Doing this at the IR level is immediately useful for all backends, not just (one of) AArch64 or AArch32.
Ok, although I expect all targets for which it matters have that support.
> 
> 5. To make sure the clamp() idiom got matched, I upped the amount that SimplifyCFG would speculate (in r229099). This caused some slight controversy - it is still in-tree but there is still the potential of pulling it out if it is found that it was too aggressive. The IR approach means we don't actually need it and can pattern match across basic blocks, which we can't do in ISel (and it'd be too late by then because all the vectorizers will have bailed out).

That is the strongest argument I see so far. There is a dependence between optimizations. And the example will come where there is just not enough speculation in SimplifyCFG for the next pattern to get recognized.
> 
> As you've said Gerolf, we have patchy max/min/abs matching and handling in many places in codegen; it would be nice if this approach could remove them all. My only worry is that at the moment, if a target does not support the new intrinsics they will be expanded, but will be expanded only after the DAG is type-legalized and so may miss out on some scalar optimizations. I'm not sure how well-founded this worry is though.
Surprises will happen anyway with a (re-) design like that. Just the scraps to deal with when they happen.

Would the new design make most the HW specific int/fp/vector intrinsics in ./include/llvm/IR redundant?
> 
> 
> Cheers,
> 
> James
> 
> 
> On Sat, 25 Apr 2015 at 04:19 Owen Anderson <resistor at mac.com <mailto:resistor at mac.com>> wrote:
> 
>> On Apr 24, 2015, at 2:19 PM, Gerolf Hoflehner <ghoflehner at apple.com <mailto:ghoflehner at apple.com>> wrote:
>> 
>> I was wondering if there were a performance advantage to have a higher level representation of min/max. My hunch is that recognizing the idiom at the low-level is ‘good enough’.
> 
> I suspect (aside from vectorization) the impact to be minimal for integer min/max.  For floating point, I expect there to be greater upside because of ISA level support, and because the SW emulation expansion is typically complex.
> 
> —Owen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150425/b57c985e/attachment.html>