<div dir="ltr">Hi,<br><br>I've also previously gone down the lines of a target-isel approach and had a patch that pretty much worked. I'll explain why I now think this is better done early in the IR. I should say though that I could be convinced to go back to the target-isel approach if there are better reasons :)<div><br></div><div>1. Matching min/max patterns can be awkward in the presence of constants. Consider the idiomatic max(x, 0) with a conversion after it:</div><div>  </div><div>  %1 = fcmp ogt float 0.0, float %x</div><div>  %2 = select i1 %1, float %x, float 0.0</div><div>  %3 = fptoui float %2 to i32</div><div><br></div><div>InstCombine will push the fptoui up:</div><div><br></div><div>  <span style="font-size:13.1999998092651px;line-height:19.7999992370605px">%1 = fcmp ogt float 0.0, float %x</span></div><div><span style="font-size:13.1999998092651px;line-height:19.7999992370605px">  %2 = fptoui float %x to i32</span><span style="font-size:13.1999998092651px;line-height:19.7999992370605px"><br></span></div><div style="font-size:13.1999998092651px;line-height:19.7999992370605px">  %3 = select i1 %1, i32 %2, i32 0</div><div style="font-size:13.1999998092651px;line-height:19.7999992370605px">  </div><div>Now, you have a much more complex pattern to match. You have to identify that "i32 0" is the same as "float 0.0" with "fptoui" applied to it. This gets even more complex when you have nested mins and maxes, for example in "clamp(x, 0, 255)". It was this that I was having difficulty efficiently and reliably matching at the ISel level, and what convinced me to move higher up the pipe where at least testing was easier.</div><div><br></div><div>Now actually, we can do even better in IR and move further still up the pipe, and run before InstCombine. Then, we don't need any complex matching at all and everything is simpler.</div><div><br></div><div>2. As I've mentioned, this approach allows anything using the cost model to be more accurate. This currently is just the vectorizers but I see other users coming along soon like loop distribution and loop fusion.</div><div><br></div><div>3. This simplifies pattern matching in backends with not just min/max support, but also with explicit "clamp" support such as GPUs. GPUs will often have (almost free) clamp_0_255 instructions or modifiers as this is used extremely often in image filters and shaders.</div><div><br></div><div>4. Doing this at the IR level is immediately useful for all backends, not just (one of) AArch64 or AArch32.</div><div><br></div><div>5. To make sure the clamp() idiom got matched, I upped the amount that SimplifyCFG would speculate (in r229099). This caused some slight controversy - it is still in-tree but there is still the potential of pulling it out if it is found that it was too aggressive. The IR approach means we don't actually need it and can pattern match across basic blocks, which we can't do in ISel (and it'd be too late by then because all the vectorizers will have bailed out).</div><div><br></div><div>As you've said Gerolf, we have patchy max/min/abs matching and handling in many places in codegen; it would be nice if this approach could remove them all. My only worry is that at the moment, if a target does not support the new intrinsics they will be expanded, but will be expanded only after the DAG is type-legalized and so may miss out on some scalar optimizations. I'm not sure how well-founded this worry is though.</div><div><br></div><div>Cheers,</div><div><br></div><div>James</div><div><br></div></div><br><div class="gmail_quote">On Sat, 25 Apr 2015 at 04:19 Owen Anderson <<a href="mailto:resistor@mac.com">resistor@mac.com</a>> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div style="word-wrap:break-word"><br><div><blockquote type="cite"><div>On Apr 24, 2015, at 2:19 PM, Gerolf Hoflehner <<a href="mailto:ghoflehner@apple.com" target="_blank">ghoflehner@apple.com</a>> wrote:</div><br><div><span style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:none;display:inline!important">I was wondering if there were a performance advantage to have a higher level representation of min/max. My hunch is that recognizing the idiom at the low-level is ‘good enough’.</span><br style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"></div></blockquote></div><br></div><div style="word-wrap:break-word"><div>I suspect (aside from vectorization) the impact to be minimal for integer min/max.  For floating point, I expect there to be greater upside because of ISA level support, and because the SW emulation expansion is typically complex.</div></div><div style="word-wrap:break-word"><div><br></div><div>—Owen</div></div></blockquote></div>