RFC: min/max/abs IR intrinsics
james at jamesmolloy.co.uk
Sun Apr 26 02:15:36 PDT 2015
On Sun, 26 Apr 2015 at 05:10 Gerolf Hoflehner <ghoflehner at apple.com> wrote:
> Please see below.
> On Apr 25, 2015, at 1:44 AM, James Molloy <james at jamesmolloy.co.uk> wrote:
> I've also previously gone down the lines of a target-isel approach and had
> a patch that pretty much worked. I'll explain why I now think this is
> better done early in the IR. I should say though that I could be convinced
> to go back to the target-isel approach if there are better reasons :)
> I’m more curious to hear about your arguments and in particular what
> performance headroom you foresee. One of our CPU architects had pointed out
> that we could generate smax/umax etc., which took me longer than it should
> have. So I thought about doing it at a higher level also, looked around,
> looked again, scratched my head and relatively quickly talked myself into
> something like 'lot of work, little headroom, target level support good
> enough' :-)
The performance improvements I see and the rationale for this is where you
have multiple mins/maxs in series. Several benchmarks I have such as image
filters use min/max to implement clamp(); obviously image filters tend to
be very small so every extra instruction counts. Similarly another image
filter uses a ternary min (min(a,b,c)), which ends up as min/min. That
currently doesn't get matched either.
I can see min/max reductions as being easier to identify (and either
vectorize or reassociate, both having big speedup potential). Finally and
perhaps the biggest headroom is in saturation. Matching the idiom clamp(x,
0, 255), you can generate a VQMOV. Even better, if the datatype is then
shrunk as it often is after such a clamp, you end up with one VQMOVN.
That's 5 instructions -> 1 instruction.
Obviously the effect this would have depends on the domain - image filters
(such as sharpen from Geekbench) are where this is really going to give
dividends. But even for non-image-filters, there's still a hygiene factor
where we really should be generating the right thing even if we don't get a
massive performance boost.
> 1. Matching min/max patterns can be awkward in the presence of constants.
> Consider the idiomatic max(x, 0) with a conversion after it:
> %1 = fcmp ogt float 0.0, float %x
> %2 = select i1 %1, float %x, float 0.0
> %3 = fptoui float %2 to i32
> InstCombine will push the fptoui up:
> %1 = fcmp ogt float 0.0, float %x
> %2 = fptoui float %x to i32
> %3 = select i1 %1, i32 %2, i32 0
> Now, you have a much more complex pattern to match. You have to identify
> that "i32 0" is the same as "float 0.0" with "fptoui" applied to it. This
> gets even more complex when you have nested mins and maxes, for example in
> "clamp(x, 0, 255)". It was this that I was having difficulty efficiently
> and reliably matching at the ISel level, and what convinced me to move
> higher up the pipe where at least testing was easier.
> Now actually, we can do even better in IR and move further still up the
> pipe, and run before InstCombine. Then, we don't need any complex matching
> at all and everything is simpler.
> I think of this as short-coming of inst combine. It could convert the
> float to int before the cmp resulting in int cmp/select. Since this an
> example for fp -> int conversion, here is another thought: It seems that
> this fmin/max -> imin/max conversion is always the right thing to do, and
> the IR level is appropriate, but in some cases that depends on the HW. For
> example, if there are two instances of that pattern one might want to
> convert one to fp and one to int so both can execute in parallel (assuming
> the HW provides the proper resources).
Yes, I think that's right, and I was going to control this with a target
hook. I'm seriously considering whether my pass might not be better written
as an instcombine rather than a standalone pass... haven't come to any
> The issue with complex pattern recognition at the ISEL level: does this
> point to a general weakness that also effect over pattern, or is this an
> isolated instance min/max at IR level will take care of?
> I think it's generally that while there are one or two patterns you'd need
to match at IR level, because there are more, specialized SDAG nodes the
problem becomes larger at that point. Also, it just stands to reason that
if you have a hard-to-match sequence, the longer you keep it around
unmatched the higher the likelihood of it being destroyed when you want to
> 2. As I've mentioned, this approach allows anything using the cost model
> to be more accurate. This currently is just the vectorizers but I see other
> users coming along soon like loop distribution and loop fusion.
> Would be good if you could share examples where the difference matters. I
> expect we’ll have to look into the cost model much more closely soon.
> I don't have explicit examples right now. I'm more thinking of hygiene
factor - making the model more accurate is a good thing regardless of
immediate benefit. But it should also help the register pressure estimation
code get a better result.
> 3. This simplifies pattern matching in backends with not just min/max
> support, but also with explicit "clamp" support such as GPUs. GPUs will
> often have (almost free) clamp_0_255 instructions or modifiers as this is
> used extremely often in image filters and shaders.
> Ok, I take your word for it.
> 4. Doing this at the IR level is immediately useful for all backends, not
> just (one of) AArch64 or AArch32.
> Ok, although I expect all targets for which it matters have that support.
> Right, although demonstrably it doesn't work at the moment, at least not
for AArch64 and AArch32. My experiments have shown that X86 isn't fantastic
either. Isolated mins/maxes, yes. Coupled mins/maxes? no.
> 5. To make sure the clamp() idiom got matched, I upped the amount that
> SimplifyCFG would speculate (in r229099). This caused some slight
> controversy - it is still in-tree but there is still the potential of
> pulling it out if it is found that it was too aggressive. The IR approach
> means we don't actually need it and can pattern match across basic blocks,
> which we can't do in ISel (and it'd be too late by then because all the
> vectorizers will have bailed out).
> That is the strongest argument I see so far. There is a dependence between
> optimizations. And the example will come where there is just not enough
> speculation in SimplifyCFG for the next pattern to get recognized.
Right. min+min+min (horizontal reduce) won't currently get speculated.
> As you've said Gerolf, we have patchy max/min/abs matching and handling in
> many places in codegen; it would be nice if this approach could remove them
> all. My only worry is that at the moment, if a target does not support the
> new intrinsics they will be expanded, but will be expanded only after the
> DAG is type-legalized and so may miss out on some scalar optimizations. I'm
> not sure how well-founded this worry is though.
> Surprises will happen anyway with a (re-) design like that. Just the
> scraps to deal with when they happen.
> Would the new design make most the HW specific int/fp/vector intrinsics in
> I certainly expect the new intrinsics to map directly to the smin/umin
NEON intrinsics in the A32 and A64 backends. It'd be nice if that was true
for other architectures too but I just don't know.
> On Sat, 25 Apr 2015 at 04:19 Owen Anderson <resistor at mac.com> wrote:
>> On Apr 24, 2015, at 2:19 PM, Gerolf Hoflehner <ghoflehner at apple.com>
>> I was wondering if there were a performance advantage to have a higher
>> level representation of min/max. My hunch is that recognizing the idiom at
>> the low-level is ‘good enough’.
>> I suspect (aside from vectorization) the impact to be minimal for integer
>> min/max. For floating point, I expect there to be greater upside because
>> of ISA level support, and because the SW emulation expansion is typically
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-commits