<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Can we agree that late matching is better than what we have now? If
so, I would suggest we implement that and then reassess. (I'm
assuming that the work for late matching isn't material enough that
"wasting" the effort if we do decide to move towards early matching
is a problem.) <br>
<br>
Owen, I disagree with you. Beyond the engineering complexity part -
which is a very real concern - I do not believe that min/max is the
*right* canonicalization. If a comparison is loop invariant, I
really really want the select to be a standard unswitch idiom rather
than having to teach the unswitcher to reason about an entirely new
class of control flow (that embedded within another intrinsic.)
(LoopUnswitch is used only as an example here. The point applies to
any control flow aware transform.)<br>
<br>
If we support min/max as canonical form, where do we stop? ABS?
NABS? SAD? Saturating arithmetic? Overflow intrinsics? I don't
see how min/max are any different than the others. <br>
<br>
Again, we need to move this discussion to llvm-dev. This has grown
beyond the scope of a patch review.<br>
<br>
Philip<br>
<br>
<div class="moz-cite-prefix">On 04/27/2015 10:32 AM, Owen Anderson
wrote:<br>
</div>
<blockquote cite="mid:42A8DA96-FA9E-40D0-AEDC-19E5D4E8D395@mac.com"
type="cite">
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
I’m going to go say that I strongly *disagree* with the late
matching concept. This seems like a clear-cut canonicalization vs
lowering situation. Matching min/max operations seems like a
straight-forward canonicalization, in that many programs that are
semantically equivalent but represented by slightly different IR
patterns would be unified after min/max matching. The issue of,
for example, ICMP reuse, seems like a lowering detail, since the
relevance of that transformation is dependent on whether the
target does or does not have fused min/max instructions.
<div class=""><br class="">
</div>
<div class="">—Owen<br class="">
<div class=""><br class="">
</div>
<div class=""><br class="">
<div>
<blockquote type="cite" class="">
<div class="">On Apr 27, 2015, at 8:23 AM, James Molloy
<<a moz-do-not-send="true"
href="mailto:james@jamesmolloy.co.uk" class="">james@jamesmolloy.co.uk</a>>
wrote:</div>
<br class="Apple-interchange-newline">
<div class="">
<div dir="ltr" class="">Hi all,<br class="">
<br class="">
Thanks for all the comments! David, I think that was
the main hurdle. With that instcombine fixed, the
matching code becomes much simpler. I think there's
only a couple of things to do:
<div class=""><br class="">
</div>
<div class=""> * Add new intrinsics and SDNodes for
[su]{min,max} - <a moz-do-not-send="true"
href="http://reviews.llvm.org/D9293" class="">http://reviews.llvm.org/D9293</a></div>
<div class=""> * Pull min/max matching out of
InstCombine and into somewhere more accessible - <a
moz-do-not-send="true"
href="http://reviews.llvm.org/D9294" class="">http://reviews.llvm.org/D9294</a></div>
<div class=""> * Match min/max patterns in
CodeGenPrepare into intrinsics, if the intrinsic
version is not marked "expand". - WIP, still writing
tests.</div>
<div class=""> * Enable min/max generation for
AArch64 and other targets - WIP, still writing
tests.</div>
<div class=""><br class="">
</div>
<div class="">Gerolf, I'm hoping the above will really
not take long. If this gets approved in concept, the
actual mechanics are very easy - I have two out of
four patches ready to go at the moment and the rest
are implemented, just needing more testing. Does it
make sense to hold off your target-specific version
for a few days and let this one go in instead,
barring major objections?</div>
<div class=""><br class="">
</div>
<div class="">Cheers,</div>
<div class=""><br class="">
</div>
<div class="">James</div>
</div>
<br class="">
<div class="gmail_quote">On Mon, 27 Apr 2015 at 02:40
Gerolf Hoflehner <<a moz-do-not-send="true"
href="mailto:ghoflehner@apple.com" class="">ghoflehner@apple.com</a>>
wrote:<br class="">
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">
<div style="word-wrap:break-word" class="">Hi James,
<div class=""><br class="">
</div>
<div class="">sounds like you have a number of
good reasons for the intrinsics. Having them
would allow a) to catch more complex pattern and
b) gain more insight into the impact that design
on the rest of the compiler. So initially there
would be a mix of higher level and low level
min/max recognition while the design evolves and
potentially addresses eg. Phillips concerns.
While all this is in progress, do you think it
is OK to commit the target speciifc vmin/vmax
patch I sent earlier?</div>
<div class=""><br class="">
</div>
<div class="">Cheers</div>
</div>
<div style="word-wrap:break-word" class="">
<div class="">Gerolf</div>
</div>
<div style="word-wrap:break-word" class="">
<div class=""><br class="">
<div class="">
<blockquote type="cite" class="">
<div class="">On Apr 26, 2015, at 2:15 AM,
James Molloy <<a moz-do-not-send="true"
href="mailto:james@jamesmolloy.co.uk"
target="_blank" class="">james@jamesmolloy.co.uk</a>>
wrote:</div>
<br class="">
<div class="">
<div dir="ltr"
style="font-family:Helvetica;font-size:18px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"
class="">Hi Gerolf,<br class="">
<br class="">
<div class="gmail_quote">On Sun, 26 Apr
2015 at 05:10 Gerolf Hoflehner <<a
moz-do-not-send="true"
href="mailto:ghoflehner@apple.com"
target="_blank" class="">ghoflehner@apple.com</a>>
wrote:<br class="">
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div style="word-wrap:break-word"
class="">Please see below.
<div class=""><br class="">
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">
<blockquote type="cite"
class="">
<div class="">On Apr 25,
2015, at 1:44 AM, James
Molloy <<a
moz-do-not-send="true"
href="mailto:james@jamesmolloy.co.uk"
target="_blank" class="">james@jamesmolloy.co.uk</a>>
wrote:</div>
<br class="">
<div class="">
<div dir="ltr" class="">Hi,<br
class="">
<br class="">
I've also previously
gone down the lines of a
target-isel approach and
had a patch that pretty
much worked. I'll
explain why I now think
this is better done
early in the IR. I
should say though that I
could be convinced to go
back to the target-isel
approach if there are
better reasons :)</div>
</div>
</blockquote>
</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">
<div class="">I’m more curious
to hear about your arguments
and in particular what
performance headroom you
foresee. One of our CPU
architects had pointed out
that we could generate
smax/umax etc., which took
me longer than it should
have. So I thought about
doing it at a higher level
also, looked around, looked
again, scratched my head and
relatively quickly talked
myself into something like
'lot of work, little
headroom, target level
support good enough' :-)</div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">The performance
improvements I see and the rationale
for this is where you have multiple
mins/maxs in series. Several
benchmarks I have such as image
filters use min/max to implement
clamp(); obviously image filters
tend to be very small so every extra
instruction counts. Similarly
another image filter uses a ternary
min (min(a,b,c)), which ends up as
min/min. That currently doesn't get
matched either.</div>
<div class=""><br class="">
</div>
<div class="">I can see min/max
reductions as being easier to
identify (and either vectorize or
reassociate, both having big speedup
potential). Finally and perhaps the
biggest headroom is in saturation.
Matching the idiom clamp(x, 0, 255),
you can generate a VQMOV. Even
better, if the datatype is then
shrunk as it often is after such a
clamp, you end up with one VQMOVN.
That's 5 instructions -> 1
instruction.</div>
<div class=""><br class="">
</div>
<div class="">Obviously the effect
this would have depends on the
domain - image filters (such as
sharpen from Geekbench) are where
this is really going to give
dividends. But even for
non-image-filters, there's still a
hygiene factor where we really
should be generating the right thing
even if we don't get a massive
performance boost.</div>
<div class=""> </div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">
<blockquote type="cite"
class="">
<div class="">
<div dir="ltr" class="">
<div class=""><br
class="">
</div>
<div class="">1.
Matching min/max
patterns can be
awkward in the
presence of constants.
Consider the idiomatic
max(x, 0) with a
conversion after it:</div>
<div class=""> </div>
<div class=""> <span
class=""> </span>%1
= fcmp ogt float 0.0,
float %x</div>
<div class=""> <span
class=""> </span>%2
= select i1 %1, float
%x, float 0.0</div>
<div class=""> <span
class=""> </span>%3
= fptoui float %2 to
i32</div>
<div class=""><br
class="">
</div>
<div class="">InstCombine
will push the fptoui
up:</div>
<div class=""><br
class="">
</div>
<div class=""> <span
style="font-size:13.1999998092651px;line-height:19.7999992370605px"
class="">%1 = fcmp
ogt float 0.0, float
%x</span></div>
<div class=""><span
style="font-size:13.1999998092651px;line-height:19.7999992370605px"
class=""> <span
class=""> </span>%2
= fptoui float %x to
i32</span><span
style="font-size:13.1999998092651px;line-height:19.7999992370605px"
class=""><br
class="">
</span></div>
<div
style="font-size:13.1999998092651px;line-height:19.7999992370605px"
class=""> <span
class=""> </span>%3
= select i1 %1, i32
%2, i32 0</div>
<div
style="font-size:13.1999998092651px;line-height:19.7999992370605px"
class=""> </div>
<div class="">Now, you
have a much more
complex pattern to
match. You have to
identify that "i32 0"
is the same as "float
0.0" with "fptoui"
applied to it. This
gets even more complex
when you have nested
mins and maxes, for
example in "clamp(x,
0, 255)". It was this
that I was having
difficulty efficiently
and reliably matching
at the ISel level, and
what convinced me to
move higher up the
pipe where at least
testing was easier.</div>
<div class=""><br
class="">
</div>
<div class="">Now
actually, we can do
even better in IR and
move further still up
the pipe, and run
before InstCombine.
Then, we don't need
any complex matching
at all and everything
is simpler.</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">
<div class="">I think of this
as short-coming of inst
combine. It could convert
the float to int before the
cmp resulting in int
cmp/select. Since this an
example for fp -> int
conversion, here is another
thought: It seems that this
fmin/max -> imin/max
conversion is always the
right thing to do, and the
IR level is appropriate, but
in some cases that depends
on the HW. For example, if
there are two instances of
that pattern one might want
to convert one to fp and one
to int so both can execute
in parallel (assuming the HW
provides the proper
resources). </div>
</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
<div class="">Yes, I think that's
right, and I was going to control
this with a target hook. I'm
seriously considering whether my
pass might not be better written as
an instcombine rather than a
standalone pass... haven't come to
any conclusions yet.</div>
<div class=""> </div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">
<div class=""><br class="">
</div>
<div class="">The issue with
complex pattern recognition
at the ISEL level: does this
point to a general weakness
that also effect over
pattern, or is this an
isolated instance min/max at
IR level will take care of?</div>
</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class=""><br class="">
</div>
</div>
</div>
</blockquote>
<div class="">I think it's generally
that while there are one or two
patterns you'd need to match at IR
level, because there are more,
specialized SDAG nodes the problem
becomes larger at that point. Also,
it just stands to reason that if you
have a hard-to-match sequence, the
longer you keep it around unmatched
the higher the likelihood of it
being destroyed when you want to
match it.</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">
<blockquote type="cite"
class="">
<div class="">
<div dir="ltr" class="">
<div class=""><br
class="">
</div>
<div class="">2. As I've
mentioned, this
approach allows
anything using the
cost model to be more
accurate. This
currently is just the
vectorizers but I see
other users coming
along soon like loop
distribution and loop
fusion.</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">Would be good if
you could share examples where
the difference matters. I
expect we’ll have to look into
the cost model much more
closely soon.</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class=""><br class="">
</div>
</div>
</div>
</blockquote>
<div class="">I don't have explicit
examples right now. I'm more
thinking of hygiene factor - making
the model more accurate is a good
thing regardless of immediate
benefit. But it should also help the
register pressure estimation code
get a better result. </div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">
<blockquote type="cite"
class="">
<div class="">
<div dir="ltr" class="">
<div class=""><br
class="">
</div>
<div class="">3. This
simplifies pattern
matching in backends
with not just min/max
support, but also with
explicit "clamp"
support such as GPUs.
GPUs will often have
(almost free)
clamp_0_255
instructions or
modifiers as this is
used extremely often
in image filters and
shaders.</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">Ok, I take your
word for it.</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class=""><br class="">
<blockquote type="cite"
class="">
<div class="">
<div dir="ltr" class="">
<div class=""><br
class="">
</div>
<div class="">4. Doing
this at the IR level
is immediately useful
for all backends, not
just (one of) AArch64
or AArch32.</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">Ok, although I
expect all targets for which
it matters have that support.</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class=""><br class="">
</div>
</div>
</div>
</blockquote>
<div class="">Right, although
demonstrably it doesn't work at the
moment, at least not for AArch64 and
AArch32. My experiments have shown
that X86 isn't fantastic either.
Isolated mins/maxes, yes. Coupled
mins/maxes? no. </div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">
<blockquote type="cite"
class="">
<div class="">
<div dir="ltr" class="">
<div class=""><br
class="">
</div>
<div class="">5. To make
sure the clamp() idiom
got matched, I upped
the amount that
SimplifyCFG would
speculate (in
r229099). This caused
some slight
controversy - it is
still in-tree but
there is still the
potential of pulling
it out if it is found
that it was too
aggressive. The IR
approach means we
don't actually need it
and can pattern match
across basic blocks,
which we can't do in
ISel (and it'd be too
late by then because
all the vectorizers
will have bailed out).</div>
</div>
</div>
</blockquote>
<div class=""><br class="">
</div>
</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">That is the
strongest argument I see so
far. There is a dependence
between optimizations. And the
example will come where there
is just not enough speculation
in SimplifyCFG for the next
pattern to get recognized.</div>
</div>
</div>
</blockquote>
<div class="">Right. min+min+min
(horizontal reduce) won't currently
get speculated. </div>
<div class=""><br class="">
</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div style="word-wrap:break-word"
class="">
<div class="">
<div class=""><br class="">
<blockquote type="cite"
class="">
<div class="">
<div dir="ltr" class="">
<div class=""><br
class="">
</div>
<div class="">As you've
said Gerolf, we have
patchy max/min/abs
matching and handling
in many places in
codegen; it would be
nice if this approach
could remove them all.
My only worry is that
at the moment, if a
target does not
support the new
intrinsics they will
be expanded, but will
be expanded only after
the DAG is
type-legalized and so
may miss out on some
scalar optimizations.
I'm not sure how
well-founded this
worry is though.</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">Surprises will
happen anyway with a (re-)
design like that. Just the
scraps to deal with when they
happen.</div>
<div class=""><br class="">
</div>
<div class="">Would the new
design make most the HW
specific int/fp/vector
intrinsics in ./inc </div>
</div>
</div>
</blockquote>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">lude/llvm/IR
redundant?</div>
</div>
</div>
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">
<blockquote type="cite"
class="">
<div class="">
<div dir="ltr" class="">
<div class=""><br
class="">
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
<div class="">I certainly expect the
new intrinsics to map directly to
the smin/umin NEON intrinsics in the
A32 and A64 backends. It'd be nice
if that was true for other
architectures too but I just don't
know. </div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div style="word-wrap:break-word"
class="">
<div class="">
<div class="">
<blockquote type="cite"
class="">
<div class="">
<div dir="ltr" class="">
<div class="">Cheers,</div>
<div class=""><br
class="">
</div>
<div class="">James</div>
<div class=""><br
class="">
</div>
</div>
<br class="">
<div class="gmail_quote">On
Sat, 25 Apr 2015 at
04:19 Owen Anderson <<a
moz-do-not-send="true"
href="mailto:resistor@mac.com" target="_blank" class="">resistor@mac.com</a>>
wrote:<br class="">
<blockquote
class="gmail_quote"
style="margin:0px 0px
0px
0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">
<div
style="word-wrap:break-word"
class=""><br
class="">
<div class="">
<blockquote
type="cite"
class="">
<div class="">On
Apr 24, 2015,
at 2:19 PM,
Gerolf
Hoflehner <<a
moz-do-not-send="true" href="mailto:ghoflehner@apple.com"
target="_blank"
class="">ghoflehner@apple.com</a>>
wrote:</div>
<br class="">
<div class=""><span
style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;float:none;display:inline!important"
class="">I was
wondering if
there were a
performance
advantage to
have a higher
level
representation
of min/max. My
hunch is that
recognizing
the idiom at
the low-level
is ‘good
enough’.</span><br
style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-height:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"
class="">
</div>
</blockquote>
</div>
<br class="">
</div>
<div
style="word-wrap:break-word"
class="">
<div class="">I
suspect (aside
from
vectorization) the
impact to be
minimal for
integer min/max.
For floating
point, I expect
there to be
greater upside
because of ISA
level support, and
because the SW
emulation
expansion is
typically complex.</div>
</div>
<div
style="word-wrap:break-word"
class="">
<div class=""><br
class="">
</div>
<div class="">—Owen</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
<br class="">
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<br>
<pre wrap="">_______________________________________________
llvm-commits mailing list
<a class="moz-txt-link-abbreviated" href="mailto:llvm-commits@cs.uiuc.edu">llvm-commits@cs.uiuc.edu</a>
<a class="moz-txt-link-freetext" href="http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits">http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits</a>
</pre>
</blockquote>
<br>
</body>
</html>