<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<div class="moz-cite-prefix">On 04/29/2014 12:39 PM, Filip Pizlo
wrote:<br>
</div>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<style>body{font-family:Helvetica,Arial;font-size:13px}</style>On
April 29, 2014 at 11:27:06 AM, Philip Reames (<a
moz-do-not-send="true" href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)
wrote:
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0,
0); font-family: Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant: normal; font-weight: normal;
letter-spacing: normal; line-height: normal; orphans: auto;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; background-color: rgb(255,
255, 255);"><span>
<div bgcolor="#FFFFFF" text="#000000">
<div>
<div class="moz-cite-prefix">On 04/29/2014 10:44 AM,
Filip Pizlo wrote:<br>
</div>
<blockquote
cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"
type="cite">
<div id="bloop_customfont" style="font-family:
Helvetica, Arial; font-size: 13px; color: rgb(0, 0,
0); margin: 0px;">LD;DR: Your desire to use trapping
on x86 only further convinces me that Michael's
proposed intrinsics are the best way to go.</div>
</blockquote>
I'm still not convinced, but am not going to actively
oppose it either. I'm leery of designing a solution
with major assumptions we don't have data to backup. <br>
<br>
I worry your assumptions about deoptimization are
potentially unsound. But I don't have data to actually
show this (yet).</div>
</div>
</span></blockquote>
</div>
<p>I *think* I may have been unclear about my assumptions; in
particular, my claims with respect to deoptimization are
probably more subtle than they appeared. WebKit can use LLVM
and it has divisions and we do all possible
deoptimization/profiling/etc tricks for it, so this is grounded
in experience. Forgive me if the rest of this e-mail contains a
lecture on things that are obvious - I'll try to err on the side
of clarity and completeness since this discussion is
sufficiently dense that we run the risk of talking
cross-purposes unless some baseline assumptions are established.</p>
</blockquote>
I think we're using the same terminology, but with slightly
different sets of assumptions. I'll point this out below where
relevant. <br>
<br>
Also, thanks for taking the time to expand. It help clarify the
discussion quite a bit. <br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0,
0, 0); font-family: Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div bgcolor="#FFFFFF" text="#000000">
<div><br>
<br>
<blockquote
cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"
type="cite"><br>
<p style="color: rgb(0, 0, 0);">On April 29, 2014 at
10:09:49 AM, Philip Reames (<a
moz-do-not-send="true"
href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)
wrote:</p>
<div>
<blockquote type="cite" class="clean_bq"
style="color: rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px; font-style:
normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height:
normal; orphans: auto; text-align: start;
text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing:
0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);">
<div bgcolor="#FFFFFF" text="#000000">
<div><span>As the discussion has progressed
and I've spent more time thinking about
the topic, I find myself less and less
enthused about the current proposal. I'm
in full support of having idiomatic ways
to express safe division, but I'm starting
to doubt that using an intrinsic is the
right way at the moment.<br>
<br>
One case I find myself thinking about is
how one would combine profiling
information and implicit
div-by-zero/overflow checks with this
proposal. I don't really see a clean
way. Ideally, for a "safe div" which
never has the exceptional paths taken,
you'd like to completely do away with the
control flow entirely. (And rely on
hardware traps w/exceptions instead.) I
don't really see a way to represent that
type of construct given the current
proposal. </span></div>
</div>
</blockquote>
</div>
<p>This is a deeper problem and to solve it you'd
need a solution to trapping in general. Let's
consider the case of Java. A Java program may
want to catch the arithmetic exception due to
divide by zero. How would you do this with a trap
in LLVM IR? Spill all state that is live at the
catch? Use a patchpoint for the entire division
instruction?</p>
</blockquote>
We'd likely use something similar to a patchpoint.
You'd need the "abstract vm state" (which is not the
compiled frame necessarily) available at the div
instruction. You could then re-enter the interpreter
at the specified index (part of the vm state). We
have all most of these mechanisms in place. Ideally,
you'd trigger a recompile and otherwise ensure
re-entry into compiled code at the soonest possible
moment. <br>
<br>
This requires a lot of runtime support, but we already
have most of it implemented for another compiler.
From our perspective, the runtime requirements are not
a major blocker. </div>
</div>
</span></blockquote>
</div>
<p>Right, you'll use a patchpoint. That's way more expensive
than using a safe division intrinsic with branches, because
it's opaque to the optimizer.</p>
</div>
</blockquote>
This statement is true at the moment, but it shouldn't be. I think
this is our fundamental difference in approach. <br>
<br>
You should be able to write something like:<br>
i32 %res = invoke patchpoint (... x86_trapping_divide, a, b)
normal_dest invoke_dest<br>
<br>
normal_dest:<br>
;; use %res<br>
invoke_dest:<br>
landingpad<br>
;; dispatch edge cases<br>
;; this could be unreachable code if you deopt this frame in the
trap handler and jump directly to an interpreter or other bit of
code<br>
<br>
A patchpoint should not require any excess spilling. If values are
live in registers, that should be reflected in the stack map. (I do
not know if this is the case for patchpoint at the moment or not.)<br>
<br>
The Value called by a patchpoint should participate in optimization
normally. We really want the patchpoint part of the call to be
supplemental. It should still be a call. It should be constant
propagated, transformed, etc.. This is not the case currently.
I've got a couple of off the wall ideas for improving the current
status, but I'll agree this is a hardish problem. <br>
<br>
It should be legal to use a patchpoint in an invoke. It's an ABI
issue of how the invoke path gets invoked. (i.e. side tables for
the runtime to lookup, etc..) This is not possible today, and
probably requires a fair amount of work. Some of it, I've already
done and will be sharing shortly. Other parts, I haven't even
thought about. <br>
<br>
If you didn't want to use the trapping semantics, you'd insert
dedicated control flow _before_ the divide. This would allow normal
optimization of the control flow. <br>
<br>
Notes:<br>
1) This might require a new PATCHPOINT pseudo op in the backend.
Haven't thought much about that yet.<br>
2) I *think* your current intrinsic could be translated into
something like this. (Leaving aside the question of where the deopt
state comes from.) In fact, the more I look at this, the less
difference I actually see between the approaches. <br>
<br>
<br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<div>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color:
rgb(0, 0, 0); font-family: Helvetica, Arial; font-size:
13px; font-style: normal; font-variant: normal;
font-weight: normal; letter-spacing: normal; line-height:
normal; orphans: auto; text-align: start; text-indent:
0px; text-transform: none; white-space: normal; widows:
auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div bgcolor="#FFFFFF" text="#000000">
<div>
<blockquote
cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"
type="cite">
<p><br class="Apple-interchange-newline">
In a lot of languages, a divide produces some
result even in the exceptional case and this
result requires effectively deoptimizing since
the resut won't be the one you would have
predicted (double instead of int, or BigInt
instead of small int), which sort of means that
if the CPU exception occurs you have to be able
to reconstruct all state. A patchpoint could do
this, and so could spilling all state to the
stack before the divide - but both are very
heavy hammers that are sure to be more expensive
than just doing a branch.</p>
</blockquote>
This isn't necessarily as expensive as you might
believe. I'd recommend reading the Graal project
papers on this topic.<br>
<br>
Whether deopt or branching is more profitable *in
this case*, I can't easily say. I'm not yet to the
point of being able to run that experiment. We can
argue about what "should" be better all we want, but
real performance data is the only way to truly
know. </div>
</div>
</span></blockquote>
</div>
<p>My point may have been confusing. I know that
deoptimization is cheap and WebKit uses it everywhere,
including division corner cases, if profiling tells us that
it's profitable to do so (which it does, in the common
case). WebKit is a heavy user of deoptimization in general,
so you don't need to convince me that it's worth it.</p>
</div>
</div>
</blockquote>
Acknowledged. <br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<div>
<div>
<p>Note that I want *both* deopt *and* branching, because in
this case, a branch is the fastest overall way of detecting
when to deopt. In the future, I will want to implement the
deopt in terms of branching, and when we do this, I believe
that the most sound and performat approach would be using
Michael's intrinsics. This is subtle and I'll try to
explain why it's the case.</p>
<p>The point is that you wouldn't want to do deoptimization by
spilling state on the main path or by using a patchpoint for
the main path of the division.</p>
</div>
</div>
</blockquote>
This is the main point I disagree with. I don't believe that having
a patchpoint on the main path should be any more expensive then the
original call. (see above)<br>
<br>
Worth noting explicitly: I'm assuming that all of your deopt state
would already be available for other purposes in nearby code. It's
on the stack or in registers. I'm assuming that by adding the deopt
point, you are not radically changing the set of computations which
need to be done. If that's not the case, you should avoid deopt and
instead just inline the slow paths with explicit checks. <br>
<br>
I'll note that given your assumptions about the cost of a
patchpoint, the rest of your position makes a lot more sense. :)
As I spelled out above, I believe this cost is not fundamental. <br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<div>
<div>
<p>You don't want the common path of executing the division to
involve a patchpoint instruction, although using a
patchpoint or stackmap to implement deoptimization on the
failing path is great:</p>
<p><b>Good: </b>if (division would fail) { call
@patchpoint(all of my state) } else { result = a / b }</p>
</div>
</div>
</blockquote>
Given your cost assumptions, I'd agree. <br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<div>
<div>
<p><b>Bad: </b>call @patchpoint(all of my state) // patch
with a divide instruction - bad because the optimizer has no
clue what you're doing and assumes the very worst</p>
</div>
</div>
</blockquote>
Yuck. Agreed. <br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<div>
<div>
<p><b>Worse: </b>spill all state to the stack; call
@trapping.div(a, b) // the spills will hurt you far more
than a branch, so this should be avoided</p>
</div>
</div>
</blockquote>
I'm assuming this is an explicit spill rather than simply recording
a stack map *at the div*. If so, agreed. <br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<div>
<div>
<p>I suppose we could imagine a fourth option that involves a
patchpoint to pick up the state and a trapping divide
instrinsic. But a trapping divide intrinsic alone is not
enough. Consider this:</p>
<p>result = call @trapping.div(a, b); call @stackmap(all of my
state)</p>
<p>As soon as these are separate instructions, you have no
guarantees that the state that the stackmap reports is sound
for the point at which the div would trap. <br>
</p>
</div>
</div>
</blockquote>
This is the closest to what I'd propose, except that the two calls
would be merged into a single patchpoint. Isn't the entire point of
a patchpoint to record the stack map for a call? (Well, ignoring
the actual patching part..) Why not write this as:<br>
patchpoint(..., trapping.div, a, b);<br>
<br>
Is there something I'm missing here?<br>
<br>
Just to note: I fully agree that the two call proposal is unsound
and should be avoided. <br>
<br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<div>
<div>
<p>So, the division itself shouldn't be a trapping instruction
and instead you want to detect the bad case with a branch.</p>
<p>To be clear:</p>
<p>- Whether you use deoptimization for division or anything
else - like WebKit has done since before any of the Graal
papers were written - is mostly unrelated to how you
represent the division, unless you wanted to add a new
intrinsic that is like a trapping-division-with-stackmap:</p>
<p>result = call @trapping.div.with.stackmap(a, b, ... all of
my state ...)</p>
<p>Now, maybe you do want such an intrinsic, in which case you
should propose it! <br>
</p>
</div>
</div>
</blockquote>
Given what we already have with patchpoints, I don't think a merged
intrinsic is necessary. (See above). I believe we have all the
parts to build this solution, and that - in theory - they should
compose neatly.<br>
<br>
p.s. The bits I was referring to was not deopt per se. It was
particularly which set of deopt state you used for each deopt
point. That's a bit of tangent for the rest of the discussion now
though. <br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<div>
<div>
<p>The reason why I haven't proposed it is that I think that
long-term, the currently proposed intrinsics are a better
path to getting the trapping optimizations. See my previous
mail, where I show how we could tell LLVM what the failing
path is (which may have deoptimization code that uses a
stackmap or whatever), what the trapping predicate is (it
comes from the safe.div intrinsic), and the fact that
trapping is wise (branch weights).</p>
<p>- If you want to do the deoptimization with a trap, then
your only choice currently is to use a patchpoint for the
main path of the division. This will be slower than using a
branch to an OSR exit basic block, because you're making the
division itself opaque to the optimizer (bad) just to get
rid of a branch (which was probably cheap to begin with).</p>
<p>Hence, what you want to do - one way or another, regardless
of whether this proposed intrinsic is added - is to branch
on the corner case condition, and have the slow case of the
branch go to a basic block that deoptimizes. Unless of
course you have profiling that says that the case does
happen often, in which case you can have that basic block
handle the corner case inline without leaving optimized code
(FWIW, we do have such paths in WebKit and they are useful).</p>
<p>So the question for me is whether the branching involves
explicit control flow or is hidden inside an intrinsic. I
prefer for it to be within an intrinsic because it:</p>
<p>- allows the optimizer to do more interesting things in the
common cases, like hoisting the entire division.</p>
<p>- will give us a clearer path for implementing trapping
optimizations in the future.</p>
<p>- is an immediate win on ARM.</p>
<p>I'd be curious to hear what specific idea you have about
how to implement trap-based deoptimization with your
trapping division intrinsic for x86 - maybe it's different
from the two "bad" idioms I showed above.</p>
</div>
</div>
</blockquote>
I hope my explanation above helps. If not, ask, and I'll try to
explain more clearly. <br>
<br>
One point just for clarity; I don't believe this effects the
conclusions of our discussion so far. I'm also fairly sure that you
(Filip) are aware of this, but want to spell it out for other
readers. <br>
<br>
You seem to be assuming that compiled code needs to explicitly
branch to a point where deopt state is known to exit a compiled
frame. Worth noting is that you can also exit a compiled frame on a
trap (without an explicitly condition/branch!) if the deopt state is
known at the point you take the trap. This "exit frame on trap"
behavior shows up with null pointer exceptions as well. I'll note
that both compilers in OpenJDK support some combination of
"exit-on-trap" conditions for division and null dereferences. The
two differ on exactly what strategies they use in each case though.
:)<br>
<br>
I'm not really arguing that either scheme is "better" in all cases.
I'm simply arguing that we should support both and allow
optimization and tuning between them. As far as I can tell, you
seem to be assuming that an explicit branch to known exit point is
always superior. <br>
<br>
<br>
Ok, back to the topic at hand...<br>
<br>
With regards to the current proposal, I'm going to take a step
back. You guys seem to have already looked in this in a fair amount
of depth. I'm not necessarily convinced you've come to the best
solution, but at some point, we need to make forward progress. What
you have is clearly better than nothing. <br>
<br>
Please go ahead and submit your current approach. We can come back
and revise later if we really need to. <br>
<br>
I do request the following changes:<br>
- Mark it clearly as experimental.<br>
- Either don't specify the value computed in the edge cases, or
allow those values to be specified as constant arguments to the
call. This allows efficient lowering to x86's div instruction if
you want to make use of the trapping semantics. <br>
<br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<div>
<div>
<p>Finally, as for performance data, which part of this do you
want performance data for? I concede that I don't have
performance data for using Michael's new intrinsic. Part of
what the intrinsic accomplishes is it gives a less ugly way
of doing something that is already possible with target
intrinsics on ARM. I think it would be great if you could
get those semantics - along with a known-good implementation
- on other architectures as well.</p>
</div>
</div>
</blockquote>
I would be very interested in seeing data comparing two schemes:<br>
- Explicit control flow emited by the frontend<br>
- The safe.div intrinsic emitted by the frontend, desugared in
CodeGenPrep<br>
<br>
My strong suspicion is that each would preform well in some cases
and not in others. At least on x86. Since the edge-checks are
essentially free on ARM, the second scheme would probably be
strictly superior there. <br>
<br>
I am NOT asking that we block submission on this data however. <br>
<br>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<div>
<div>
<p>But this discussion has also involved suggestions that we
should use trapping to implement deoptimization, and the
main point of my message is to strongly argue against
anything like this given the current state of trapping
semantics and how patchpoints work. My point is that using
traps for division corner cases would either be unsound (see
the stackmap after the trap, above), or would require you to
do things that are obviously inefficient. If you truly
believe that the branch to detect division slow paths is
more expensive than spilling all bytecode state to the stack
or using a patchpoint for the division, then I could
probably hack something up in WebKit to show you the
performance implications. (Or you could do it yourself, the
code is open source...)</p>
</div>
</div>
</blockquote>
In a couple of months, I'll probably have the performance data to
discuss this for real. When that happens, let's pick this up and
continue the debate. Alternatively, if you want to chat this over
more with a beer in hand at the social next week, let me know. In
the meantime, let's not stall the current proposal any more. <br>
<br>
Philip<br>
<br>
</body>
</html>