<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Filip,<br>
<br>
Thanks for spelling this out. I'm out of time to respond right now,
but I'll try to get back to you later today or tomorrow morning.
After a quick read through, I don't think you've changed my opinion,
but I need to read through what you wrote more carefully before
responding.<br>
<br>
Philip<br>
<br>
<div class="moz-cite-prefix">On 04/29/2014 12:39 PM, Filip Pizlo
wrote:<br>
</div>
<blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"
type="cite">
<style>body{font-family:Helvetica,Arial;font-size:13px}</style>
<div id="bloop_customfont"
style="font-family:Helvetica,Arial;font-size:13px; color:
rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br>
</div>
<br>
<p style="color:#000;">On April 29, 2014 at 11:27:06 AM, Philip
Reames (<a moz-do-not-send="true"
href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)
wrote:</p>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0, 0,
0); font-family: Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant: normal; font-weight: normal;
letter-spacing: normal; line-height: normal; orphans: auto;
text-align: start; text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing: 0px;
-webkit-text-stroke-width: 0px; background-color: rgb(255,
255, 255);"><span>
<div bgcolor="#FFFFFF" text="#000000">
<div>
<div class="moz-cite-prefix"><br
class="Apple-interchange-newline">
On 04/29/2014 10:44 AM, Filip Pizlo wrote:<br>
</div>
<blockquote
cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"
type="cite">
<div id="bloop_customfont" style="font-family:
Helvetica, Arial; font-size: 13px; color: rgb(0, 0,
0); margin: 0px;">LD;DR: Your desire to use trapping
on x86 only further convinces me that Michael's
proposed intrinsics are the best way to go.</div>
</blockquote>
I'm still not convinced, but am not going to actively
oppose it either. I'm leery of designing a solution
with major assumptions we don't have data to backup. <br>
<br>
I worry your assumptions about deoptimization are
potentially unsound. But I don't have data to actually
show this (yet).</div>
</div>
</span></blockquote>
</div>
<p>I *think* I may have been unclear about my assumptions; in
particular, my claims with respect to deoptimization are
probably more subtle than they appeared. WebKit can use LLVM
and it has divisions and we do all possible
deoptimization/profiling/etc tricks for it, so this is grounded
in experience. Forgive me if the rest of this e-mail contains a
lecture on things that are obvious - I'll try to err on the side
of clarity and completeness since this discussion is
sufficiently dense that we run the risk of talking
cross-purposes unless some baseline assumptions are established.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color: rgb(0,
0, 0); font-family: Helvetica, Arial; font-size: 13px;
font-style: normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height: normal;
orphans: auto; text-align: start; text-indent: 0px;
text-transform: none; white-space: normal; widows: auto;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div bgcolor="#FFFFFF" text="#000000">
<div><br>
<br>
<blockquote
cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"
type="cite"><br>
<p style="color: rgb(0, 0, 0);">On April 29, 2014 at
10:09:49 AM, Philip Reames (<a
moz-do-not-send="true"
href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)
wrote:</p>
<div>
<blockquote type="cite" class="clean_bq"
style="color: rgb(0, 0, 0); font-family:
Helvetica, Arial; font-size: 13px; font-style:
normal; font-variant: normal; font-weight:
normal; letter-spacing: normal; line-height:
normal; orphans: auto; text-align: start;
text-indent: 0px; text-transform: none;
white-space: normal; widows: auto; word-spacing:
0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);">
<div bgcolor="#FFFFFF" text="#000000">
<div><span>As the discussion has progressed
and I've spent more time thinking about
the topic, I find myself less and less
enthused about the current proposal. I'm
in full support of having idiomatic ways
to express safe division, but I'm starting
to doubt that using an intrinsic is the
right way at the moment.<br>
<br>
One case I find myself thinking about is
how one would combine profiling
information and implicit
div-by-zero/overflow checks with this
proposal. I don't really see a clean
way. Ideally, for a "safe div" which
never has the exceptional paths taken,
you'd like to completely do away with the
control flow entirely. (And rely on
hardware traps w/exceptions instead.) I
don't really see a way to represent that
type of construct given the current
proposal. </span></div>
</div>
</blockquote>
</div>
<p>This is a deeper problem and to solve it you'd
need a solution to trapping in general. Let's
consider the case of Java. A Java program may
want to catch the arithmetic exception due to
divide by zero. How would you do this with a trap
in LLVM IR? Spill all state that is live at the
catch? Use a patchpoint for the entire division
instruction?</p>
</blockquote>
We'd likely use something similar to a patchpoint.
You'd need the "abstract vm state" (which is not the
compiled frame necessarily) available at the div
instruction. You could then re-enter the interpreter
at the specified index (part of the vm state). We
have all most of these mechanisms in place. Ideally,
you'd trigger a recompile and otherwise ensure
re-entry into compiled code at the soonest possible
moment. <br>
<br>
This requires a lot of runtime support, but we already
have most of it implemented for another compiler.
From our perspective, the runtime requirements are not
a major blocker. </div>
</div>
</span></blockquote>
</div>
<p>Right, you'll use a patchpoint. That's way more expensive
than using a safe division intrinsic with branches, because
it's opaque to the optimizer.</p>
<div>
<div>
<blockquote type="cite" class="clean_bq" style="color:
rgb(0, 0, 0); font-family: Helvetica, Arial; font-size:
13px; font-style: normal; font-variant: normal;
font-weight: normal; letter-spacing: normal; line-height:
normal; orphans: auto; text-align: start; text-indent:
0px; text-transform: none; white-space: normal; widows:
auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(255, 255, 255);"><span>
<div bgcolor="#FFFFFF" text="#000000">
<div>
<blockquote
cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"
type="cite">
<p><br class="Apple-interchange-newline">
In a lot of languages, a divide produces some
result even in the exceptional case and this
result requires effectively deoptimizing since
the resut won't be the one you would have
predicted (double instead of int, or BigInt
instead of small int), which sort of means that
if the CPU exception occurs you have to be able
to reconstruct all state. A patchpoint could do
this, and so could spilling all state to the
stack before the divide - but both are very
heavy hammers that are sure to be more expensive
than just doing a branch.</p>
</blockquote>
This isn't necessarily as expensive as you might
believe. I'd recommend reading the Graal project
papers on this topic.<br>
<br>
Whether deopt or branching is more profitable *in
this case*, I can't easily say. I'm not yet to the
point of being able to run that experiment. We can
argue about what "should" be better all we want, but
real performance data is the only way to truly
know. </div>
</div>
</span></blockquote>
</div>
<p>My point may have been confusing. I know that
deoptimization is cheap and WebKit uses it everywhere,
including division corner cases, if profiling tells us that
it's profitable to do so (which it does, in the common
case). WebKit is a heavy user of deoptimization in general,
so you don't need to convince me that it's worth it.</p>
<p>Note that I want *both* deopt *and* branching, because in
this case, a branch is the fastest overall way of detecting
when to deopt. In the future, I will want to implement the
deopt in terms of branching, and when we do this, I believe
that the most sound and performat approach would be using
Michael's intrinsics. This is subtle and I'll try to
explain why it's the case.</p>
<p>The point is that you wouldn't want to do deoptimization by
spilling state on the main path or by using a patchpoint for
the main path of the division.</p>
<p>You don't want the common path of executing the division to
involve a patchpoint instruction, although using a
patchpoint or stackmap to implement deoptimization on the
failing path is great:</p>
<p><b>Good: </b>if (division would fail) { call
@patchpoint(all of my state) } else { result = a / b }</p>
<p><b>Bad: </b>call @patchpoint(all of my state) // patch
with a divide instruction - bad because the optimizer has no
clue what you're doing and assumes the very worst</p>
<p><b>Worse: </b>spill all state to the stack; call
@trapping.div(a, b) // the spills will hurt you far more
than a branch, so this should be avoided</p>
<p>I suppose we could imagine a fourth option that involves a
patchpoint to pick up the state and a trapping divide
instrinsic. But a trapping divide intrinsic alone is not
enough. Consider this:</p>
<p>result = call @trapping.div(a, b); call @stackmap(all of my
state)</p>
<p>As soon as these are separate instructions, you have no
guarantees that the state that the stackmap reports is sound
for the point at which the div would trap. So, the division
itself shouldn't be a trapping instruction and instead you
want to detect the bad case with a branch.</p>
<p>To be clear:</p>
<p>- Whether you use deoptimization for division or anything
else - like WebKit has done since before any of the Graal
papers were written - is mostly unrelated to how you
represent the division, unless you wanted to add a new
intrinsic that is like a trapping-division-with-stackmap:</p>
<p>result = call @trapping.div.with.stackmap(a, b, ... all of
my state ...)</p>
<p>Now, maybe you do want such an intrinsic, in which case you
should propose it! The reason why I haven't proposed it is
that I think that long-term, the currently proposed
intrinsics are a better path to getting the trapping
optimizations. See my previous mail, where I show how we
could tell LLVM what the failing path is (which may have
deoptimization code that uses a stackmap or whatever), what
the trapping predicate is (it comes from the safe.div
intrinsic), and the fact that trapping is wise (branch
weights).</p>
<p>- If you want to do the deoptimization with a trap, then
your only choice currently is to use a patchpoint for the
main path of the division. This will be slower than using a
branch to an OSR exit basic block, because you're making the
division itself opaque to the optimizer (bad) just to get
rid of a branch (which was probably cheap to begin with).</p>
<p>Hence, what you want to do - one way or another, regardless
of whether this proposed intrinsic is added - is to branch
on the corner case condition, and have the slow case of the
branch go to a basic block that deoptimizes. Unless of
course you have profiling that says that the case does
happen often, in which case you can have that basic block
handle the corner case inline without leaving optimized code
(FWIW, we do have such paths in WebKit and they are useful).</p>
<p>So the question for me is whether the branching involves
explicit control flow or is hidden inside an intrinsic. I
prefer for it to be within an intrinsic because it:</p>
<p>- allows the optimizer to do more interesting things in the
common cases, like hoisting the entire division.</p>
<p>- will give us a clearer path for implementing trapping
optimizations in the future.</p>
<p>- is an immediate win on ARM.</p>
<p>I'd be curious to hear what specific idea you have about
how to implement trap-based deoptimization with your
trapping division intrinsic for x86 - maybe it's different
from the two "bad" idioms I showed above.</p>
<p>Finally, as for performance data, which part of this do you
want performance data for? I concede that I don't have
performance data for using Michael's new intrinsic. Part of
what the intrinsic accomplishes is it gives a less ugly way
of doing something that is already possible with target
intrinsics on ARM. I think it would be great if you could
get those semantics - along with a known-good implementation
- on other architectures as well.</p>
<p>But this discussion has also involved suggestions that we
should use trapping to implement deoptimization, and the
main point of my message is to strongly argue against
anything like this given the current state of trapping
semantics and how patchpoints work. My point is that using
traps for division corner cases would either be unsound (see
the stackmap after the trap, above), or would require you to
do things that are obviously inefficient. If you truly
believe that the branch to detect division slow paths is
more expensive than spilling all bytecode state to the stack
or using a patchpoint for the division, then I could
probably hack something up in WebKit to show you the
performance implications. (Or you could do it yourself, the
code is open source...)</p>
<p>-Filip</p>
</div>
</div>
</blockquote>
<br>
</body>
</html>