<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">I’m going to go say that I strongly *disagree* with the late matching concept.  This seems like a clear-cut canonicalization vs lowering situation.  Matching min/max operations seems like a straight-forward canonicalization, in that many programs that are semantically equivalent but represented by slightly different IR patterns would be unified after min/max matching.  The issue of, for example, ICMP reuse, seems like a lowering detail, since the relevance of that transformation is dependent on whether the target does or does not have fused min/max instructions.<div class=""><br class=""></div><div class="">—Owen<br class=""><div class=""><br class=""></div><div class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Apr 27, 2015, at 8:23 AM, James Molloy <<a href="mailto:james@jamesmolloy.co.uk" class="">james@jamesmolloy.co.uk</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Hi all,<br class=""><br class="">Thanks for all the comments! David, I think that was the main hurdle. With that instcombine fixed, the matching code becomes much simpler. I think there's only a couple of things to do:<div class=""><br class=""></div><div class="">  * Add new intrinsics and SDNodes for [su]{min,max} - <a href="http://reviews.llvm.org/D9293" class="">http://reviews.llvm.org/D9293</a></div><div class="">  * Pull min/max matching out of InstCombine and into somewhere more accessible - <a href="http://reviews.llvm.org/D9294" class="">http://reviews.llvm.org/D9294</a></div><div class="">  * Match min/max patterns in CodeGenPrepare into intrinsics, if the intrinsic version is not marked "expand". - WIP, still writing tests.</div><div class="">  * Enable min/max generation for AArch64 and other targets - WIP, still writing tests.</div><div class=""><br class=""></div><div class="">Gerolf, I'm hoping the above will really not take long. If this gets approved in concept, the actual mechanics are very easy - I have two out of four patches ready to go at the moment and the rest are implemented, just needing more testing. Does it make sense to hold off your target-specific version for a few days and let this one go in instead, barring major objections?</div><div class=""><br class=""></div><div class="">Cheers,</div><div class=""><br class=""></div><div class="">James</div></div><br class=""><div class="gmail_quote">On Mon, 27 Apr 2015 at 02:40 Gerolf Hoflehner <<a href="mailto:ghoflehner@apple.com" class="">ghoflehner@apple.com</a>> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word" class="">Hi James,<div class=""><br class=""></div><div class="">sounds like you have a number of good reasons for the intrinsics. Having them would allow a) to catch more complex pattern and b) gain more insight into the impact that design on the rest of the compiler. So initially there would be a mix of higher level and low level min/max recognition while the design evolves and potentially addresses  eg. Phillips concerns. While all this is in progress, do you think it is OK to commit the target speciifc vmin/vmax patch I sent earlier?</div><div class=""><br class=""></div><div class="">Cheers</div></div><div style="word-wrap:break-word" class=""><div class="">Gerolf</div></div><div style="word-wrap:break-word" class=""><div class=""><br class=""><div class=""><blockquote type="cite" class=""><div class="">On Apr 26, 2015, at 2:15 AM, James Molloy <<a href="mailto:james@jamesmolloy.co.uk" target="_blank" class="">james@jamesmolloy.co.uk</a>> wrote:</div><br class=""><div class=""><div dir="ltr" style="font-family:Helvetica;font-size:18px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px" class="">Hi Gerolf,<br class=""><br class=""><div class="gmail_quote">On Sun, 26 Apr 2015 at 05:10 Gerolf Hoflehner <<a href="mailto:ghoflehner@apple.com" target="_blank" class="">ghoflehner@apple.com</a>> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word" class="">Please see below.<div class=""><br class=""><div class=""></div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class=""><blockquote type="cite" class=""><div class="">On Apr 25, 2015, at 1:44 AM, James Molloy <<a href="mailto:james@jamesmolloy.co.uk" target="_blank" class="">james@jamesmolloy.co.uk</a>> wrote:</div><br class=""><div class=""><div dir="ltr" class="">Hi,<br class=""><br class="">I've also previously gone down the lines of a target-isel approach and had a patch that pretty much worked. I'll explain why I now think this is better done early in the IR. I should say though that I could be convinced to go back to the target-isel approach if there are better reasons :)</div></div></blockquote></div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class=""><div class="">I’m more curious to hear about your arguments and in particular what performance headroom you foresee. One of our CPU architects had pointed out that we could generate smax/umax etc., which took me longer than it should have. So I thought about doing it at a higher level also, looked around, looked again, scratched my head and relatively quickly talked myself into something  like 'lot of work, little headroom, target level support good enough' :-)</div></div></div></div></blockquote><div class=""><br class=""></div><div class="">The performance improvements I see and the rationale for this is where you have multiple mins/maxs in series. Several benchmarks I have such as image filters use min/max to implement clamp(); obviously image filters tend to be very small so every extra instruction counts. Similarly another image filter uses a ternary min (min(a,b,c)), which ends up as min/min. That currently doesn't get matched either.</div><div class=""><br class=""></div><div class="">I can see min/max reductions as being easier to identify (and either vectorize or reassociate, both having big speedup potential). Finally and perhaps the biggest headroom is in saturation. Matching the idiom clamp(x, 0, 255), you can generate a VQMOV. Even better, if the datatype is then shrunk as it often is after such a clamp, you end up with one VQMOVN. That's 5 instructions -> 1 instruction.</div><div class=""><br class=""></div><div class="">Obviously the effect this would have depends on the domain - image filters (such as sharpen from Geekbench) are where this is really going to give dividends. But even for non-image-filters, there's still a hygiene factor where we really should be generating the right thing even if we don't get a massive performance boost.</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><div class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">1. Matching min/max patterns can be awkward in the presence of constants. Consider the idiomatic max(x, 0) with a conversion after it:</div><div class="">  </div><div class=""> <span class=""> </span>%1 = fcmp ogt float 0.0, float %x</div><div class=""> <span class=""> </span>%2 = select i1 %1, float %x, float 0.0</div><div class=""> <span class=""> </span>%3 = fptoui float %2 to i32</div><div class=""><br class=""></div><div class="">InstCombine will push the fptoui up:</div><div class=""><br class=""></div><div class="">  <span style="font-size:13.1999998092651px;line-height:19.7999992370605px" class="">%1 = fcmp ogt float 0.0, float %x</span></div><div class=""><span style="font-size:13.1999998092651px;line-height:19.7999992370605px" class=""> <span class=""> </span>%2 = fptoui float %x to i32</span><span style="font-size:13.1999998092651px;line-height:19.7999992370605px" class=""><br class=""></span></div><div style="font-size:13.1999998092651px;line-height:19.7999992370605px" class=""> <span class=""> </span>%3 = select i1 %1, i32 %2, i32 0</div><div style="font-size:13.1999998092651px;line-height:19.7999992370605px" class="">  </div><div class="">Now, you have a much more complex pattern to match. You have to identify that "i32 0" is the same as "float 0.0" with "fptoui" applied to it. This gets even more complex when you have nested mins and maxes, for example in "clamp(x, 0, 255)". It was this that I was having difficulty efficiently and reliably matching at the ISel level, and what convinced me to move higher up the pipe where at least testing was easier.</div><div class=""><br class=""></div><div class="">Now actually, we can do even better in IR and move further still up the pipe, and run before InstCombine. Then, we don't need any complex matching at all and everything is simpler.</div></div></div></blockquote></div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class=""><div class="">I think of this as short-coming of inst combine. It could convert the float to int before the cmp resulting in int cmp/select. Since this an example for fp -> int conversion, here is another thought:  It seems that this fmin/max -> imin/max conversion is always the right thing to do, and the IR level is appropriate, but in some cases that depends on the HW. For example, if there are two instances of that pattern one might want to convert one to fp and one to int so both can execute in parallel (assuming the HW provides the proper resources). </div></div></div></div></blockquote><div class=""><br class=""></div><div class="">Yes, I think that's right, and I was going to control this with a target hook. I'm seriously considering whether my pass might not be better written as an instcombine rather than a standalone pass... haven't come to any conclusions yet.</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><div class=""><div class=""><br class=""></div><div class="">The issue with complex pattern recognition at the ISEL level: does this point to a general weakness that also effect over pattern, or is this an isolated instance min/max at IR level will take care of?</div></div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class=""><br class=""></div></div></div></blockquote><div class="">I think it's generally that while there are one or two patterns you'd need to match at IR level, because there are more, specialized SDAG nodes the problem becomes larger at that point. Also, it just stands to reason that if you have a hard-to-match sequence, the longer you keep it around unmatched the higher the likelihood of it being destroyed when you want to match it.</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><div class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">2. As I've mentioned, this approach allows anything using the cost model to be more accurate. This currently is just the vectorizers but I see other users coming along soon like loop distribution and loop fusion.</div></div></div></blockquote></div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class="">Would be good if you could share examples where the difference matters. I expect we’ll have to look into the cost model much more closely soon.</div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class=""><br class=""></div></div></div></blockquote><div class="">I don't have explicit examples right now. I'm more thinking of hygiene factor - making the model more accurate is a good thing regardless of immediate benefit. But it should also help the register pressure estimation code get a better result. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><div class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">3. This simplifies pattern matching in backends with not just min/max support, but also with explicit "clamp" support such as GPUs. GPUs will often have (almost free) clamp_0_255 instructions or modifiers as this is used extremely often in image filters and shaders.</div></div></div></blockquote></div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class="">Ok, I take your word for it.</div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class=""><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">4. Doing this at the IR level is immediately useful for all backends, not just (one of) AArch64 or AArch32.</div></div></div></blockquote></div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class="">Ok, although I expect all targets for which it matters have that support.</div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class=""><br class=""></div></div></div></blockquote><div class="">Right, although demonstrably it doesn't work at the moment, at least not for AArch64 and AArch32. My experiments have shown that X86 isn't fantastic either. Isolated mins/maxes, yes. Coupled mins/maxes? no. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><div class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">5. To make sure the clamp() idiom got matched, I upped the amount that SimplifyCFG would speculate (in r229099). This caused some slight controversy - it is still in-tree but there is still the potential of pulling it out if it is found that it was too aggressive. The IR approach means we don't actually need it and can pattern match across basic blocks, which we can't do in ISel (and it'd be too late by then because all the vectorizers will have bailed out).</div></div></div></blockquote><div class=""><br class=""></div></div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class="">That is the strongest argument I see so far. There is a dependence between optimizations. And the example will come where there is just not enough speculation in SimplifyCFG for the next pattern to get recognized.</div></div></div></blockquote><div class="">Right. min+min+min (horizontal reduce) won't currently get speculated. </div><div class=""><br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><div class=""><br class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""></div><div class="">As you've said Gerolf, we have patchy max/min/abs matching and handling in many places in codegen; it would be nice if this approach could remove them all. My only worry is that at the moment, if a target does not support the new intrinsics they will be expanded, but will be expanded only after the DAG is type-legalized and so may miss out on some scalar optimizations. I'm not sure how well-founded this worry is though.</div></div></div></blockquote></div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class="">Surprises will happen anyway with a (re-) design like that. Just the scraps to deal with when they happen.</div><div class=""><br class=""></div><div class="">Would the new design make most the HW specific int/fp/vector intrinsics in ./inc </div></div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><div class="">lude/llvm/IR redundant?</div></div></div><div style="word-wrap:break-word" class=""><div class=""><div class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""><br class=""></div></div></div></blockquote></div></div></div></blockquote><div class="">I certainly expect the new intrinsics to map directly to the smin/umin NEON intrinsics in the A32 and A64 backends. It'd be nice if that was true for other architectures too but I just don't know. </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><div class=""><div class=""><blockquote type="cite" class=""><div class=""><div dir="ltr" class=""><div class=""></div><div class="">Cheers,</div><div class=""><br class=""></div><div class="">James</div><div class=""><br class=""></div></div><br class=""><div class="gmail_quote">On Sat, 25 Apr 2015 at 04:19 Owen Anderson <<a href="mailto:resistor@mac.com" target="_blank" class="">resistor@mac.com</a>> wrote:<br class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word" class=""><br class=""><div class=""><blockquote type="cite" class=""><div class="">On Apr 24, 2015, at 2:19 PM, Gerolf Hoflehner <<a href="mailto:ghoflehner@apple.com" target="_blank" class="">ghoflehner@apple.com</a>> wrote:</div><br class=""><div class=""><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:none;display:inline!important" class="">I was wondering if there were a performance advantage to have a higher level representation of min/max. My hunch is that recognizing the idiom at the low-level is ‘good enough’.</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px" class=""></div></blockquote></div><br class=""></div><div style="word-wrap:break-word" class=""><div class="">I suspect (aside from vectorization) the impact to be minimal for integer min/max.  For floating point, I expect there to be greater upside because of ISA level support, and because the SW emulation expansion is typically complex.</div></div><div style="word-wrap:break-word" class=""><div class=""><br class=""></div><div class="">—Owen</div></div></blockquote></div></div></blockquote></div></div></div></blockquote></div></div></div></blockquote></div><br class=""></div></div></blockquote></div>

</div></blockquote></div><br class=""></div></div></body></html>