<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div><br></div><div><br>On Apr 30, 2014, at 10:34 PM, Philip Reames <<a href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>> wrote:<br><br></div><blockquote type="cite"><div>
<meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
<div class="moz-cite-prefix">Andy - If you're not already following
this closely, please start. We've gotten into fairly fundamental
questions of what a patchpoint does. <br>
<br>
Filip, <br>
<br>
I think you've hit the nail on the head. What I'm thinking of as
being patchpoints are not what you think they are. Part of that
is that I've got a local change which adds a very similar
construction (called "statepoints" for the moment), but I was
trying to keep that separate. That also includes a lot of GC
semantics which are not under discussion currently. My apologies
if that experience bled over into this conversation and made
things more confusing. <br>
<br>
I will note that the documentation for patchpoint say explicitly
the following:<br>
"The ‘<tt class="docutils literal"><span class="pre">llvm.experimental.patchpoint.*</span></tt>‘
intrinsics creates a function
call to the specified <tt class="docutils literal"><span class="pre"><target></span></tt> and records the
location of specified
values in the stack map."<br></div></div></blockquote><div><br></div><div>I'm not disputing that the patch point will *initially* call the thing you want. But the point of the patch point - and the reason why the word "patch" is part of the name - is that you're allowed to modify the machine code in the patch point from the client JIT. This presumes that the LLVM optimizer cannot infer semantics from the call target. </div><div><br></div><div>Once we started using the patch points in WebKit, we quickly realized that the only call target that makes sense is null: our inline cache state machine may return to "just call a function" so our client JIT might as well have full control of what such a call looks like. So we overwrite all of the machine code that LLVM generates for the patch point and passing null means that LLVM only emits nops. </div><br><blockquote type="cite"><div><div class="moz-cite-prefix">
<br>
My reading has always been that a patchpoint *that isn't patched*
is simply a call with a stackmap associated with it. To my
reading, this can (and did, and does) indicate my proposed usage
would be legal. <br></div></div></blockquote><div><br></div><div>Yes and I agree, but any optimizations in LLVM based on the call target would be illegal. </div><br><blockquote type="cite"><div><div class="moz-cite-prefix">
<br>
I will agree that I've confused the topic badly on the
optimization front. My "statepoint" isn't patchable, so a lot
more optimizations are legal. Sorry about that. To restate what
I think you've been saying all along, the optimizer can't make
assumptions about what function is called by a patchpoint because
that might change based on later patching. Is this the key point
you've been trying to make?<br></div></div></blockquote><div><br></div><div>Yup!</div><br><blockquote type="cite"><div><div class="moz-cite-prefix">
<br>
I'm not objecting to separating "my patchpoint" from "your
patchpoint". Let's just hammer out the semantics of each first.
:)<br>
<br>
Again, longer response to follow in a day or so. :)<br>
<br>
Philip<br>
<br>
On 04/30/2014 10:09 PM, Filip Pizlo wrote:<br>
</div>
<blockquote cite="mid:etPan.5361d71d.1d4ed43b.172db@dethklok.local" type="cite">
<style>body{font-family:Helvetica,Arial;font-size:13px}</style>
<div id="bloop_customfont" style="font-family:Helvetica,Arial;font-size:13px; color:
rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br>
</div>
<br>
<p style="color:#000;">On April 30, 2014 at 9:06:20 PM, Philip
Reames (<a moz-do-not-send="true" href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)
wrote:</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0,
0, 0); font-family: Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div>
<div class="moz-cite-prefix">On 04/29/2014 12:39 PM,
Filip Pizlo wrote:<br>
</div>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">On April 29, 2014 at 11:27:06 AM, Philip
Reames (<a moz-do-not-send="true" href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)
wrote:
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px; font-style:
normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height:
normal; orphans: auto; text-align: start;
text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing:
0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);">
<div bgcolor="#FFFFFF" text="#000000">
<div>
<div class="moz-cite-prefix"><span>On
04/29/2014 10:44 AM, Filip Pizlo wrote:<br>
</span></div>
<blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local" type="cite">
<div id="bloop_customfont" style="font-family: Helvetica, Arial;
font-size: 13px; color: rgb(0, 0, 0);
margin: 0px;"><span>LD;DR: Your desire
to use trapping on x86 only further
convinces me that Michael's proposed
intrinsics are the best way to go.</span></div>
</blockquote>
<span>I'm still not convinced, but am not
going to actively oppose it either. I'm
leery of designing a solution with major
assumptions we don't have data to backup. <br>
<br>
I worry your assumptions about
deoptimization are potentially unsound.
But I don't have data to actually show
this (yet).</span></div>
</div>
</blockquote>
</div>
<p>I *think* I may have been unclear about my
assumptions; in particular, my claims with respect
to deoptimization are probably more subtle than
they appeared. WebKit can use LLVM and it has
divisions and we do all possible
deoptimization/profiling/etc tricks for it, so
this is grounded in experience. Forgive me if the
rest of this e-mail contains a lecture on things
that are obvious - I'll try to err on the side of
clarity and completeness since this discussion is
sufficiently dense that we run the risk of talking
cross-purposes unless some baseline assumptions
are established.</p>
</blockquote>
I think we're using the same terminology, but with
slightly different sets of assumptions. I'll point
this out below where relevant. <br>
<br>
Also, thanks for taking the time to expand. It help
clarify the discussion quite a bit. </div>
</div>
</span></blockquote>
</div>
<p>I think we may be converging to an understanding of what you
want versus what I want, and I think that there are some
points - possibly unrelated to division - that are worth
clarifying. I think that the main difference is that when I
say "patchpoint", I am referring to a concrete intrinsic with
specific semantics that cannot change without breaking WebKit,
while you are using the term to refer to a broad concept, or
rather, a class of as-yet-unimplemented intrinsics that share
some of the same features with patchpoints but otherwise have
incompatible semantics.</p>
<p>Also, when I say that you wouldn't want to use the existing
patchpoint to do your trapping deopt, what I mean is that the
performance of doing this would suck for reasons that are not
related to deoptimization or trapping. I'm not claiming that
deoptimization performs poorly (trust me, I know better) or
that trapping to deoptimize is bad (I've done this many, many
times and I know better). I'm saying that with the current
patchpoint intrinsics in LLVM, as they are currently specified
and implemented, doing it would be a bad idea because you'd
have to compromise a bunch of other optimizations to achieve
it.</p>
<p>You have essentially described new intrinsics that would make
this less of a bad idea and I am interested in your
intrinsics, so I'll try to both respond with why patchpoints
don't currently give you what you want (and why simply
changing patchpoint semantics would be evil) and I'll also try
to comment on what I think of the intrinsic that you're
effectively proposing. Long story short, I think you should
formally propose your intrinsic even if it's not completely
fleshed out. I think that it's an interesting capability and
in its most basic form, it is a simple variation of the
current patchpoint/stackmap intrinsics.</p>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0,
0, 0); font-family: Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px; font-style:
normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height:
normal; orphans: auto; text-align: start;
text-indent: 0px; text-transform: none;
white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width:
0px; background-color: rgb(255, 255, 255);">
<div bgcolor="#FFFFFF" text="#000000">
<div><span><br class="Apple-interchange-newline">
<br>
<br>
</span>
<blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local" type="cite"><span><br>
</span>
<p style="color: rgb(0, 0, 0);"><span>On
April 29, 2014 at 10:09:49 AM,
Philip Reames (<a moz-do-not-send="true" href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)
wrote:</span></p>
<div>
<blockquote type="cite" class="clean_bq" style="color:
rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant:
normal; font-weight: normal;
letter-spacing: normal; line-height:
normal; orphans: auto; text-align:
start; text-indent: 0px;
text-transform: none; white-space:
normal; widows: auto; word-spacing:
0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255,
255);">
<div bgcolor="#FFFFFF" text="#000000">
<div><span><span>As the discussion
has progressed and I've
spent more time thinking
about the topic, I find
myself less and less
enthused about the current
proposal. I'm in full
support of having idiomatic
ways to express safe
division, but I'm starting
to doubt that using an
intrinsic is the right way
at the moment.<br>
<br>
One case I find myself
thinking about is how one
would combine profiling
information and implicit
div-by-zero/overflow checks
with this proposal. I don't
really see a clean way.
Ideally, for a "safe div"
which never has the
exceptional paths taken,
you'd like to completely do
away with the control flow
entirely. (And rely on
hardware traps w/exceptions
instead.) I don't really
see a way to represent that
type of construct given the
current proposal. </span></span></div>
</div>
</blockquote>
</div>
<p>This is a deeper problem and to solve
it you'd need a solution to trapping
in general. Let's consider the case
of Java. A Java program may want to
catch the arithmetic exception due to
divide by zero. How would you do this
with a trap in LLVM IR? Spill all
state that is live at the catch? Use
a patchpoint for the entire division
instruction?</p>
</blockquote>
We'd likely use something similar to a
patchpoint. You'd need the "abstract vm
state" (which is not the compiled frame
necessarily) available at the div
instruction. You could then re-enter the
interpreter at the specified index (part
of the vm state). We have all most of
these mechanisms in place. Ideally, you'd
trigger a recompile and otherwise ensure
re-entry into compiled code at the soonest
possible moment. <br>
<br>
This requires a lot of runtime support,
but we already have most of it implemented
for another compiler. From our
perspective, the runtime requirements are
not a major blocker. </div>
</div>
</blockquote>
</div>
<p>Right, you'll use a patchpoint. That's way
more expensive than using a safe division
intrinsic with branches, because it's opaque to
the optimizer.</p>
</div>
</blockquote>
This statement is true at the moment, but it shouldn't
be. I think this is our fundamental difference in
approach. <br>
<br>
You should be able to write something like:<br>
i32 %res = invoke patchpoint (... x86_trapping_divide,
a, b) normal_dest invoke_dest<br>
<br>
normal_dest:<br>
;; use %res<br>
invoke_dest:<br>
landingpad<br>
;; dispatch edge cases<br>
;; this could be unreachable code if you deopt this
frame in the trap handler and jump directly to an
interpreter or other bit of code</div>
</div>
</span></blockquote>
</div>
</div>
<p>I see. It sounds like you want a generalization of the
"div.with.stackmap" that I thought you wanted - you want to be
able to wrap anything in a stackmap.</p>
<p>The current patchpoint intrinsic does not do this, and you run
the risk of breaking existing semantics if you changed this.
You'd probably break WebKit, which treats the call target of
the patchpoint as nothing more than a quirk - we always pass
null. Also, the current patchpoint treats the callee as an i8*
if I remember right and it would be super weird if all LLVM
phases had to interpret this i8* by unwrapping a possible
bitcast to get to a declared function that may be an intrinsic.
Yuck! Basically, the call target of existing patchpoints is
meant to be a kind of convenience feature rather than the core
of the mechanism.</p>
<p>I agree in principle that the intrinsic that you want would be
a useful intrinsic. But let's not call it a patchpoint for the
purposes of this discussion, and let's not confuse the
discussion by claiming (incorrectly) that the existing
patchpoint facility gives you what you want. It doesn't:
patchpoints are designed to make the call target opaque (hence
the use of i8*) and there shouldn't be a correlation between
what the patchpoint does at run-time and what the called
function would have done. You could make the call target be
null (like WebKit does) and the patchpoint should still mean
"this code can do anything" because the expectation is that the
client JIT will patch over it anyway.</p>
<p>Also, "patchpoint" would probably not be the right term for the
intrinsic that you want. I think that what you want is
"call.with.stackmap". Or maybe "stackmap.wrapper". Or just
"stackmap" - I'd be OK, in principle, with changing the name of
the current "stackmap" intrinsic to something that reflects the
fact that it's less of a stackmap than what you want.</p>
<p>To summarize. A patchpoint's main purpose is that you can
patch it with arbitrary code. The current "stackmap" means that
you can patch it with arbitrary code and that patching may be
destructive to a shadow of machine code bytes, so it's really
just like patchpoints - we could change its name to
"patchpoint.shadow" for example.</p>
<p>If you were to propose such a stackmap intrinsic, then I think
there could be some ways of doing it that wouldn't be too
terrible. Basically you want something that is like a
patchpoint in that it reports a stackmap via a side channel, but
unlike patchpoints, it doesn't allow arbitrary patching -
instead the optimizer should be allowed to assume that the code
within the patchpoint will always do the same thing that the
call target would have done. There are downsides to truly doing
this. For example, to make division efficient with such an
intrinsic, you'd have to do something that is somewhat worse
than just recognizing intrinsics in the optimizer - you'd have
to first recognize a call to your "stackmap wrapper" intrinsic
and then observe that its call target argument is in turn
another intrinsic. To me personally, that's kind of yucky, but
I won't deny that it could be useful.</p>
<p>As to the use of invoke: I don't believe that the use of invoke
versus my suggested "branch on a trap predicate" idea are
different in any truly meaningful way. I buy that either would
work.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0,
0, 0); font-family: Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div><br>
<br>
A patchpoint should not require any excess spilling.
If values are live in registers, that should be
reflected in the stack map. (I do not know if this is
the case for patchpoint at the moment or not.)</div>
</div>
</span></blockquote>
</div>
<p>Patchpoints do not require spilling.</p>
<p>My point was that with existing patchpoints, you *either* use
a patchpoint for the entire division which makes the division
opaque to the optimizer - because a patchpoint means "this
code can do anything" - *or* you could spill stuff to the
stack prior to your trapping division intrinsic, since
spilling is something that you could do as a workaround if you
didn't have a patchpoint.</p>
<p>The reason why I brought up spilling at all is that I suspect
that spilling all state to the stack might be cheaper - for
some systems - than turning the division into a patchpoint.
Turning the division into a patchpoint is horrendously brutal
- the patchpoint looks like it clobbers the heap (which a
division doesn't do), has to execute (a division is an obvious
DCE candidate), cannot be hoisted (hoisting divisions is
awesome), etc. Perhaps most importantly, though, a patchpoint
doesn't tell LLVM that you're *doing a division* - so all
constant folding, strenght reduction, and algebraic reasoning
flies out the window. On the other hand, spilling all state
to the stack is an arguably sound and performant solution to a
lot of VM problems. I've seen JVM implementations that ensure
that there is always a copy of state on the stack at some
critical points, just because it makes loads of stuff simpler
(debugging, profiling, GC, and of course deopt). I personally
prefer to stay away from such a strategy because it's not
free.</p>
<p>On the other hand, branches can be cheap. A branch on a
divide is cheaper than not being able to optimize the divide.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color:
rgb(0, 0, 0); font-family: Helvetica, Arial; font-size:
13px; font-style: normal; font-variant: normal;
font-weight: normal; letter-spacing: normal; line-height:
normal; orphans: auto; text-align: start; text-indent:
0px; text-transform: none; white-space: normal; widows:
auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div><br>
<br>
The Value called by a patchpoint should participate
in optimization normally. <span class="Apple-converted-space"> </span></div>
</div>
</span></blockquote>
</div>
<p>I agree that you could have a different intrinsic that
behaves like this.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color:
rgb(0, 0, 0); font-family: Helvetica, Arial; font-size:
13px; font-style: normal; font-variant: normal;
font-weight: normal; letter-spacing: normal;
line-height: normal; orphans: auto; text-align: start;
text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; background-color:
rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div>We really want the patchpoint part of the call
to be supplemental. It should still be a call.
It should be constant propagated, transformed,
etc.. This is not the case currently. I've got a
couple of off the wall ideas for improving the
current status, but I'll agree this is a hardish
problem. <br>
<br>
It should be legal to use a patchpoint in an
invoke. It's an ABI issue of how the invoke path
gets invoked. (i.e. side tables for the runtime
to lookup, etc..) This is not possible today, and
probably requires a fair amount of work. Some of
it, I've already done and will be sharing
shortly. Other parts, I haven't even thought
about. </div>
</div>
</span></blockquote>
</div>
<p>Right, it's significantly more complex than either the
existing patchpoints or Michael's proposed safe.div.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color:
rgb(0, 0, 0); font-family: Helvetica, Arial;
font-size: 13px; font-style: normal; font-variant:
normal; font-weight: normal; letter-spacing: normal;
line-height: normal; orphans: auto; text-align: start;
text-indent: 0px; text-transform: none; white-space:
normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; background-color:
rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div><br>
<br>
If you didn't want to use the trapping
semantics, you'd insert dedicated control flow
_before_ the divide. This would allow normal
optimization of the control flow. <br>
<br>
Notes:<br>
1) This might require a new PATCHPOINT pseudo op
in the backend. Haven't thought much about that
yet.<br>
2) I *think* your current intrinsic could be
translated into something like this. (Leaving
aside the question of where the deopt state
comes from.) In fact, the more I look at this,
the less difference I actually see between the
approaches. <br>
<br>
<br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">
<div>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0, 0);
font-family: Helvetica, Arial;
font-size: 13px; font-style: normal;
font-variant: normal; font-weight:
normal; letter-spacing: normal;
line-height: normal; orphans: auto;
text-align: start; text-indent: 0px;
text-transform: none; white-space:
normal; widows: auto; word-spacing:
0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);">
<div bgcolor="#FFFFFF" text="#000000">
<div>
<blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local" type="cite">
<p><span><br class="Apple-interchange-newline">
In a lot of languages, a
divide produces some result
even in the exceptional case
and this result requires
effectively deoptimizing
since the resut won't be the
one you would have predicted
(double instead of int, or
BigInt instead of small
int), which sort of means
that if the CPU exception
occurs you have to be able
to reconstruct all state. A
patchpoint could do this,
and so could spilling all
state to the stack before
the divide - but both are
very heavy hammers that are
sure to be more expensive
than just doing a branch.</span></p>
</blockquote>
<span>This isn't necessarily as
expensive as you might believe.
I'd recommend reading the Graal
project papers on this topic.<br>
<br>
Whether deopt or branching is
more profitable *in this case*,
I can't easily say. I'm not yet
to the point of being able to
run that experiment. We can
argue about what "should" be
better all we want, but real
performance data is the only way
to truly know. </span></div>
</div>
</blockquote>
</div>
<p>My point may have been confusing. I
know that deoptimization is cheap and
WebKit uses it everywhere, including
division corner cases, if profiling
tells us that it's profitable to do so
(which it does, in the common case).
WebKit is a heavy user of
deoptimization in general, so you don't
need to convince me that it's worth it.</p>
</div>
</div>
</blockquote>
Acknowledged. <br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">
<div>
<div>
<p>Note that I want *both* deopt *and*
branching, because in this case, a
branch is the fastest overall way of
detecting when to deopt. In the future,
I will want to implement the deopt in
terms of branching, and when we do this,
I believe that the most sound and
performat approach would be using
Michael's intrinsics. This is subtle
and I'll try to explain why it's the
case.</p>
<p>The point is that you wouldn't want to
do deoptimization by spilling state on
the main path or by using a patchpoint
for the main path of the division.</p>
</div>
</div>
</blockquote>
This is the main point I disagree with. I don't
believe that having a patchpoint on the main
path should be any more expensive then the
original call. (see above)</div>
</div>
</span></blockquote>
</div>
<p>The reason why the patchpoint is expensive is that if
you use a patchpoint to implement a division then the
optimizer won't be allowed to assume that it's a
division, because the whole point of "patchpoint" is to
tell the optimizer to piss off.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color:
rgb(0, 0, 0); font-family: Helvetica, Arial;
font-size: 13px; font-style: normal; font-variant:
normal; font-weight: normal; letter-spacing: normal;
line-height: normal; orphans: auto; text-align:
start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing:
0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div><br>
<br>
Worth noting explicitly: I'm assuming that all
of your deopt state would already be available
for other purposes in nearby code. It's on
the stack or in registers. I'm assuming that
by adding the deopt point, you are not
radically changing the set of computations
which need to be done. If that's not the
case, you should avoid deopt and instead just
inline the slow paths with explicit checks. </div>
</div>
</span></blockquote>
</div>
<p>Yes, of course it is. That's not the issue.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px; font-style:
normal; font-variant: normal; font-weight: normal;
letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent:
0px; text-transform: none; white-space: normal;
widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; background-color:
rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div><br>
<br>
I'll note that given your assumptions about
the cost of a patchpoint, the rest of your
position makes a lot more sense. :) As I
spelled out above, I believe this cost is
not fundamental. <br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">
<div>
<div>
<p>You don't want the common path of
executing the division to involve a
patchpoint instruction, although
using a patchpoint or stackmap to
implement deoptimization on the
failing path is great:</p>
<p><b>Good:</b><span class="Apple-converted-space"> </span>if
(division would fail) { call
@patchpoint(all of my state) } else
{ result = a / b }</p>
</div>
</div>
</blockquote>
Given your cost assumptions, I'd agree. </div>
</div>
</span></blockquote>
</div>
<p>Not my cost assumptions. The reason why this is
better is that the division is expressed in LLVM IR
so that LLVM can do useful things to it - like
eliminate it, for example.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px; font-style:
normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height:
normal; orphans: auto; text-align: start;
text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing:
0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">
<div>
<div>
<p><b><br class="Apple-interchange-newline">
Bad:</b><span class="Apple-converted-space"> </span>call
@patchpoint(all of my state) //
patch with a divide instruction -
bad because the optimizer has no
clue what you're doing and assumes
the very worst</p>
</div>
</div>
</blockquote>
Yuck. Agreed. </div>
</div>
</span></blockquote>
</div>
<p>To be clear, this is what you're proposing -
except that you're assuming that LLVM will know
that you've patched a division because you're
expecting the call target to have semantic
meaning. Or, rather, you're expecting that you
can make the contents of the patchpoint be a
division by having the call target be a division
intrinsic. In the current implementation and as
it is currently specified, the call target has no
meaning and so you get the yuck that I'm showing.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px; font-style:
normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height:
normal; orphans: auto; text-align: start;
text-indent: 0px; text-transform: none;
white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width:
0px; background-color: rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">
<div>
<div>
<p><b><br class="Apple-interchange-newline">
Worse:</b><span class="Apple-converted-space"> </span>spill
all state to the stack; call
@trapping.div(a, b) // the
spills will hurt you far more
than a branch, so this should be
avoided</p>
</div>
</div>
</blockquote>
I'm assuming this is an explicit spill
rather than simply recording a stack map
*at the div*. If so, agreed. <br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">
<div>
<div>
<p>I suppose we could imagine a
fourth option that involves a
patchpoint to pick up the state
and a trapping divide
instrinsic. But a trapping
divide intrinsic alone is not
enough. Consider this:</p>
<p>result = call @trapping.div(a,
b); call @stackmap(all of my
state)</p>
<p>As soon as these are separate
instructions, you have no
guarantees that the state that
the stackmap reports is sound
for the point at which the div
would trap. <br>
</p>
</div>
</div>
</blockquote>
This is the closest to what I'd propose,
except that the two calls would be
merged into a single patchpoint. Isn't
the entire point of a patchpoint to
record the stack map for a call? <span class="Apple-converted-space"> </span></div>
</div>
</span></blockquote>
</div>
<p>No. It would be bad if that's what the
documentation says. That's not at all how
WebKit uses it or probably any IC client would
use it.</p>
<p>Patchpoints are designed to be inline assembly
on steroids. They're there to allow the client
JIT to tell LLVM to piss off.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant: normal;
font-weight: normal; letter-spacing: normal;
line-height: normal; orphans: auto;
text-align: start; text-indent: 0px;
text-transform: none; white-space: normal;
widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div>(Well, ignoring the actual patching
part..) Why not write this as:<br>
patchpoint(..., trapping.div, a, b);<br>
<br>
Is there something I'm missing here?<br>
<br>
Just to note: I fully agree that the
two call proposal is unsound and
should be avoided. <br>
<br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">
<div>
<div>
<p>So, the division itself
shouldn't be a trapping
instruction and instead you
want to detect the bad case
with a branch.</p>
<p>To be clear:</p>
<p>- Whether you use
deoptimization for division or
anything else - like WebKit
has done since before any of
the Graal papers were written
- is mostly unrelated to how
you represent the division,
unless you wanted to add a new
intrinsic that is like a
trapping-division-with-stackmap:</p>
<p>result = call
@trapping.div.with.stackmap(a,
b, ... all of my state ...)</p>
<p>Now, maybe you do want such
an intrinsic, in which case
you should propose it! <br>
</p>
</div>
</div>
</blockquote>
Given what we already have with
patchpoints, I don't think a merged
intrinsic is necessary. (See above).
I believe we have all the parts to
build this solution, and that - in
theory - they should compose neatly.<br>
<br>
p.s. The bits I was referring to was
not deopt per se. It was particularly
which set of deopt state you used for
each deopt point. That's a bit of
tangent for the rest of the discussion
now though. <br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">
<div>
<div>
<p>The reason why I haven't
proposed it is that I think
that long-term, the currently
proposed intrinsics are a
better path to getting the
trapping optimizations. See
my previous mail, where I show
how we could tell LLVM what
the failing path is (which may
have deoptimization code that
uses a stackmap or whatever),
what the trapping predicate is
(it comes from the safe.div
intrinsic), and the fact that
trapping is wise (branch
weights).</p>
<p>- If you want to do the
deoptimization with a trap,
then your only choice
currently is to use a
patchpoint for the main path
of the division. This will be
slower than using a branch to
an OSR exit basic block,
because you're making the
division itself opaque to the
optimizer (bad) just to get
rid of a branch (which was
probably cheap to begin with).</p>
<p>Hence, what you want to do -
one way or another, regardless
of whether this proposed
intrinsic is added - is to
branch on the corner case
condition, and have the slow
case of the branch go to a
basic block that deoptimizes.
Unless of course you have
profiling that says that the
case does happen often, in
which case you can have that
basic block handle the corner
case inline without leaving
optimized code (FWIW, we do
have such paths in WebKit and
they are useful).</p>
<p>So the question for me is
whether the branching involves
explicit control flow or is
hidden inside an intrinsic. I
prefer for it to be within an
intrinsic because it:</p>
<p>- allows the optimizer to do
more interesting things in the
common cases, like hoisting
the entire division.</p>
<p>- will give us a clearer path
for implementing trapping
optimizations in the future.</p>
<p>- is an immediate win on ARM.</p>
<p>I'd be curious to hear what
specific idea you have about
how to implement trap-based
deoptimization with your
trapping division intrinsic
for x86 - maybe it's different
from the two "bad" idioms I
showed above.</p>
</div>
</div>
</blockquote>
I hope my explanation above helps. If
not, ask, and I'll try to explain more
clearly. </div>
</div>
</span></blockquote>
</div>
<p>I think I understand it. I think that the
only issue is that:</p>
<p>- Patchpoints currently don't do what you
want.</p>
<p>- If you made patchpoints do what you want
then you'd break WebKit - not to mention
anyone who wants to use them for inline
caches.</p>
<p>So it seems like you want a new intrinsic.
You should officially propose this new
intrinsic, particularly since the core
semantic differences are not so great from
what we have now. OTOH, if you truly believe
that patchpoints should just be changed to
your semantics in a way that does break
WebKit, then that's probably also something
you should get off your chest. ;-)</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant: normal;
font-weight: normal; letter-spacing:
normal; line-height: normal; orphans:
auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal;
widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div><br>
<br>
One point just for clarity; I don't
believe this effects the conclusions
of our discussion so far. I'm also
fairly sure that you (Filip) are
aware of this, but want to spell it
out for other readers. <br>
<br>
You seem to be assuming that
compiled code needs to explicitly
branch to a point where deopt state
is known to exit a compiled frame. <span class="Apple-converted-space"> </span></div>
</div>
</span></blockquote>
</div>
<p>This is a slightly unclear characterization
of my assumptions. Our JIT does
deoptimization without explicit branches for
many, many things. You should look at it
some time, it's pretty fancy. ;-)</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant:
normal; font-weight: normal;
letter-spacing: normal; line-height:
normal; orphans: auto; text-align:
start; text-indent: 0px; text-transform:
none; white-space: normal; widows: auto;
word-spacing: 0px;
-webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div>Worth noting is that you can
also exit a compiled frame on a
trap (without an explicitly
condition/branch!) if the deopt
state is known at the point you
take the trap. This "exit frame
on trap" behavior shows up with
null pointer exceptions as well.
I'll note that both compilers in
OpenJDK support some combination
of "exit-on-trap" conditions for
division and null dereferences.
The two differ on exactly what
strategies they use in each case
though. :)</div>
</div>
</span></blockquote>
</div>
<p>Yeah, and I've also implemented VMs that
do this - and I endorse the basic idea. I
know what you want, and my only point is
that the existing patchpoints only give
you this if you're willing to make a huge
compromise: namely, that you're willing to
make the division (or heap load for the
null case) completely opaque to the
compiler to the point that GVN, LICM,
SCCP, and all algebraic reasoning have to
give up on optimizing it. The point of
using LLVM is that it can optimize code.
It can optimize branches and divisions
pretty well. So, eliminating an explicit
branch by replacing it with a construct
that appears opaque to the optimizer is
not a smart trade-off.</p>
<p>You could add a new intrinsic that, like
patchpoints, records the layout of state
in a side-table, but that is used as a
kind of wrapper for operations that LLVM
understands. This may or may not be hairy
- you seem to have sort of acknowledged
that it's got some complexity and I've
also pointed out some possible issues. If
this is something that you want, you
should propose it so that others know what
you're talking about. One danger of how
we're discussing this right now is that
you're overloading patchpoints to mean the
thing you want them to mean rather than
what they actually mean, which makes it
seem like we don't need Michael's
intrinsics on the grounds that patchpoints
already offer a solution. They don't
already offer a solution precisely because
patchpoints don't do what your intrinsics
would do.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0, 0);
font-family: Helvetica, Arial;
font-size: 13px; font-style: normal;
font-variant: normal; font-weight:
normal; letter-spacing: normal;
line-height: normal; orphans: auto;
text-align: start; text-indent: 0px;
text-transform: none; white-space:
normal; widows: auto; word-spacing:
0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div><br>
<br>
I'm not really arguing that
either scheme is "better" in all
cases. I'm simply arguing that
we should support both and allow
optimization and tuning between
them. As far as I can tell, you
seem to be assuming that an
explicit branch to known exit
point is always superior.<br>
<br>
<br>
Ok, back to the topic at hand...<br>
<br>
With regards to the current
proposal, I'm going to take a
step back. You guys seem to
have already looked in this in a
fair amount of depth. I'm not
necessarily convinced you've
come to the best solution, but
at some point, we need to make
forward progress. What you have
is clearly better than nothing. <br>
<br>
Please go ahead and submit your
current approach. We can come
back and revise later if we
really need to. <br>
<br>
I do request the following
changes:<br>
- Mark it clearly as
experimental.</div>
</div>
</span></blockquote>
</div>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color:
rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant:
normal; font-weight: normal;
letter-spacing: normal; line-height:
normal; orphans: auto; text-align:
start; text-indent: 0px;
text-transform: none; white-space:
normal; widows: auto; word-spacing:
0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255,
255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div><br>
- Either don't specify the
value computed in the edge
cases, or allow those values
to be specified as constant
arguments to the call. This
allows efficient lowering to
x86's div instruction if you
want to make use of the
trapping semantics. </div>
</div>
</span></blockquote>
</div>
<p>Once again: how would you use this to
get trapping semantics without
throwing all of LLVM's optimizations
out the window, in the absence of the
kind of patchpoint-like intrinsic that
you want? I ask just to make sure
that we're on the same page.</p>
<div>
<blockquote type="cite" class="clean_bq" style="color:
rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant:
normal; font-weight: normal;
letter-spacing: normal; line-height:
normal; orphans: auto; text-align:
start; text-indent: 0px;
text-transform: none; white-space:
normal; widows: auto; word-spacing:
0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255,
255);"><span>
<div text="#000000" bgcolor="#FFFFFF">
<div><br>
<br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">
<div>
<div>
<p>Finally, as for
performance data,
which part of this do
you want performance
data for? I concede
that I don't have
performance data for
using Michael's new
intrinsic. Part of
what the intrinsic
accomplishes is it
gives a less ugly way
of doing something
that is already
possible with target
intrinsics on ARM. I
think it would be
great if you could get
those semantics -
along with a
known-good
implementation - on
other architectures as
well.</p>
</div>
</div>
</blockquote>
I would be very interested in
seeing data comparing two
schemes:<br>
- Explicit control flow emited
by the frontend<br>
- The safe.div intrinsic
emitted by the frontend,
desugared in CodeGenPrep<br>
<br>
My strong suspicion is that
each would preform well in
some cases and not in others.
At least on x86. Since the
edge-checks are essentially
free on ARM, the second scheme
would probably be strictly
superior there. <br>
<br>
I am NOT asking that we block
submission on this data
however. <br>
<br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local" type="cite">
<div>
<div>
<p>But this discussion
has also involved
suggestions that we
should use trapping to
implement
deoptimization, and
the main point of my
message is to strongly
argue against anything
like this given the
current state of
trapping semantics and
how patchpoints work.
My point is that
using traps for
division corner cases
would either be
unsound (see the
stackmap after the
trap, above), or would
require you to do
things that are
obviously inefficient.
If you truly believe
that the branch to
detect division slow
paths is more
expensive than
spilling all bytecode
state to the stack or
using a patchpoint for
the division, then I
could probably hack
something up in WebKit
to show you the
performance
implications. (Or you
could do it yourself,
the code is open
source...)</p>
</div>
</div>
</blockquote>
In a couple of months, I'll
probably have the performance
data to discuss this for
real. When that happens,
let's pick this up and
continue the debate.
Alternatively, if you want to
chat this over more with a
beer in hand at the social
next week, let me know. In
the meantime, let's not stall
the current proposal any
more. <br>
<br>
Philip<br>
<br>
</div>
</div>
</span></blockquote>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
<br>
</div></blockquote></body></html>