<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Filip,<br>

    <br>

    Thanks for spelling this out.  I'm out of time to respond right now,

    but I'll try to get back to you later today or tomorrow morning. 

    After a quick read through, I don't think you've changed my opinion,

    but I need to read through what you wrote more carefully before

    responding.<br>

    <br>

    Philip<br>

    <br>

    <div class="moz-cite-prefix">On 04/29/2014 12:39 PM, Filip Pizlo

      wrote:<br>

    </div>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <style>body{font-family:Helvetica,Arial;font-size:13px}</style>

      <div id="bloop_customfont"

        style="font-family:Helvetica,Arial;font-size:13px; color:

        rgba(0,0,0,1.0); margin: 0px; line-height: auto;"><br>

      </div>

      <br>

      <p style="color:#000;">On April 29, 2014 at 11:27:06 AM, Philip

        Reames (<a moz-do-not-send="true"

          href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)

        wrote:</p>

      <div>

        <blockquote type="cite" class="clean_bq" style="color: rgb(0, 0,

          0); font-family: Helvetica, Arial; font-size: 13px;

          font-style: normal; font-variant: normal; font-weight: normal;

          letter-spacing: normal; line-height: normal; orphans: auto;

          text-align: start; text-indent: 0px; text-transform: none;

          white-space: normal; widows: auto; word-spacing: 0px;

          -webkit-text-stroke-width: 0px; background-color: rgb(255,

          255, 255);"><span>

            <div bgcolor="#FFFFFF" text="#000000">

              <div>

                <div class="moz-cite-prefix"><br

                    class="Apple-interchange-newline">

                  On 04/29/2014 10:44 AM, Filip Pizlo wrote:<br>

                </div>

                <blockquote

                  cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

                  type="cite">

                  <div id="bloop_customfont" style="font-family:

                    Helvetica, Arial; font-size: 13px; color: rgb(0, 0,

                    0); margin: 0px;">LD;DR: Your desire to use trapping

                    on x86 only further convinces me that Michael's

                    proposed intrinsics are the best way to go.</div>

                </blockquote>

                I'm still not convinced, but am not going to actively

                oppose it either.  I'm leery of designing a solution

                with major assumptions we don't have data to backup. <br>

                <br>

                I worry your assumptions about deoptimization are

                potentially unsound.  But I don't have data to actually

                show this (yet).</div>

            </div>

          </span></blockquote>

      </div>

      <p>I *think* I may have been unclear about my assumptions; in

        particular, my claims with respect to deoptimization are

        probably more subtle than they appeared.  WebKit can use LLVM

        and it has divisions and we do all possible

        deoptimization/profiling/etc tricks for it, so this is grounded

        in experience.  Forgive me if the rest of this e-mail contains a

        lecture on things that are obvious - I'll try to err on the side

        of clarity and completeness since this discussion is

        sufficiently dense that we run the risk of talking

        cross-purposes unless some baseline assumptions are established.</p>

      <div>

        <div>

          <blockquote type="cite" class="clean_bq" style="color: rgb(0,

            0, 0); font-family: Helvetica, Arial; font-size: 13px;

            font-style: normal; font-variant: normal; font-weight:

            normal; letter-spacing: normal; line-height: normal;

            orphans: auto; text-align: start; text-indent: 0px;

            text-transform: none; white-space: normal; widows: auto;

            word-spacing: 0px; -webkit-text-stroke-width: 0px;

            background-color: rgb(255, 255, 255);"><span>

              <div bgcolor="#FFFFFF" text="#000000">

                <div><br>

                  <br>

                  <blockquote

                    cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

                    type="cite"><br>

                    <p style="color: rgb(0, 0, 0);">On April 29, 2014 at

                      10:09:49 AM, Philip Reames (<a

                        moz-do-not-send="true"

                        href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)

                      wrote:</p>

                    <div>

                      <blockquote type="cite" class="clean_bq"

                        style="color: rgb(0, 0, 0); font-family:

                        Helvetica, Arial; font-size: 13px; font-style:

                        normal; font-variant: normal; font-weight:

                        normal; letter-spacing: normal; line-height:

                        normal; orphans: auto; text-align: start;

                        text-indent: 0px; text-transform: none;

                        white-space: normal; widows: auto; word-spacing:

                        0px; -webkit-text-stroke-width: 0px;

                        background-color: rgb(255, 255, 255);">

                        <div bgcolor="#FFFFFF" text="#000000">

                          <div><span>As the discussion has progressed

                              and I've spent more time thinking about

                              the topic, I find myself less and less

                              enthused about the current proposal.  I'm

                              in full support of having idiomatic ways

                              to express safe division, but I'm starting

                              to doubt that using an intrinsic is the

                              right way at the moment.<br>

                              <br>

                              One case I find myself thinking about is

                              how one would combine profiling

                              information and implicit

                              div-by-zero/overflow checks with this

                              proposal.  I don't really see a clean

                              way.  Ideally, for a "safe div" which

                              never has the exceptional paths taken,

                              you'd like to completely do away with the

                              control flow entirely.  (And rely on

                              hardware traps w/exceptions instead.)  I

                              don't really see a way to represent that

                              type of construct given the current

                              proposal. </span></div>

                        </div>

                      </blockquote>

                    </div>

                    <p>This is a deeper problem and to solve it you'd

                      need a solution to trapping in general.  Let's

                      consider the case of Java.  A Java program may

                      want to catch the arithmetic exception due to

                      divide by zero.  How would you do this with a trap

                      in LLVM IR?  Spill all state that is live at the

                      catch?  Use a patchpoint for the entire division

                      instruction?</p>

                  </blockquote>

                  We'd likely use something similar to a patchpoint. 

                  You'd need the "abstract vm state" (which is not the

                  compiled frame necessarily) available at the div

                  instruction.  You could then re-enter the interpreter

                  at the specified index (part of the vm state).  We

                  have all most of these mechanisms in place.  Ideally,

                  you'd trigger a recompile and otherwise ensure

                  re-entry into compiled code at the soonest possible

                  moment. <br>

                  <br>

                  This requires a lot of runtime support, but we already

                  have most of it implemented for another compiler. 

                  From our perspective, the runtime requirements are not

                  a major blocker. </div>

              </div>

            </span></blockquote>

        </div>

        <p>Right, you'll use a patchpoint.  That's way more expensive

          than using a safe division intrinsic with branches, because

          it's opaque to the optimizer.</p>

        <div>

          <div>

            <blockquote type="cite" class="clean_bq" style="color:

              rgb(0, 0, 0); font-family: Helvetica, Arial; font-size:

              13px; font-style: normal; font-variant: normal;

              font-weight: normal; letter-spacing: normal; line-height:

              normal; orphans: auto; text-align: start; text-indent:

              0px; text-transform: none; white-space: normal; widows:

              auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;

              background-color: rgb(255, 255, 255);"><span>

                <div bgcolor="#FFFFFF" text="#000000">

                  <div>

                    <blockquote

                      cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

                      type="cite">

                      <p><br class="Apple-interchange-newline">

                        In a lot of languages, a divide produces some

                        result even in the exceptional case and this

                        result requires effectively deoptimizing since

                        the resut won't be the one you would have

                        predicted (double instead of int, or BigInt

                        instead of small int), which sort of means that

                        if the CPU exception occurs you have to be able

                        to reconstruct all state.  A patchpoint could do

                        this, and so could spilling all state to the

                        stack before the divide - but both are very

                        heavy hammers that are sure to be more expensive

                        than just doing a branch.</p>

                    </blockquote>

                    This isn't necessarily as expensive as you might

                    believe.  I'd recommend reading the Graal project

                    papers on this topic.<br>

                    <br>

                    Whether deopt or branching is more profitable *in

                    this case*, I can't easily say.  I'm not yet to the

                    point of being able to run that experiment.  We can

                    argue about what "should" be better all we want, but

                    real performance data is the only way to truly

                    know. </div>

                </div>

              </span></blockquote>

          </div>

          <p>My point may have been confusing.  I know that

            deoptimization is cheap and WebKit uses it everywhere,

            including division corner cases, if profiling tells us that

            it's profitable to do so (which it does, in the common

            case).  WebKit is a heavy user of deoptimization in general,

            so you don't need to convince me that it's worth it.</p>

          <p>Note that I want *both* deopt *and* branching, because in

            this case, a branch is the fastest overall way of detecting

            when to deopt.  In the future, I will want to implement the

            deopt in terms of branching, and when we do this, I believe

            that the most sound and performat approach would be using

            Michael's intrinsics.  This is subtle and I'll try to

            explain why it's the case.</p>

          <p>The point is that you wouldn't want to do deoptimization by

            spilling state on the main path or by using a patchpoint for

            the main path of the division.</p>

          <p>You don't want the common path of executing the division to

            involve a patchpoint instruction, although using a

            patchpoint or stackmap to implement deoptimization on the

            failing path is great:</p>

          <p><b>Good: </b>if (division would fail) { call

            @patchpoint(all of my state) } else { result = a / b }</p>

          <p><b>Bad: </b>call @patchpoint(all of my state) // patch

            with a divide instruction - bad because the optimizer has no

            clue what you're doing and assumes the very worst</p>

          <p><b>Worse: </b>spill all state to the stack; call

            @trapping.div(a, b) // the spills will hurt you far more

            than a branch, so this should be avoided</p>

          <p>I suppose we could imagine a fourth option that involves a

            patchpoint to pick up the state and a trapping divide

            instrinsic.  But a trapping divide intrinsic alone is not

            enough.  Consider this:</p>

          <p>result = call @trapping.div(a, b); call @stackmap(all of my

            state)</p>

          <p>As soon as these are separate instructions, you have no

            guarantees that the state that the stackmap reports is sound

            for the point at which the div would trap.  So, the division

            itself shouldn't be a trapping instruction and instead you

            want to detect the bad case with a branch.</p>

          <p>To be clear:</p>

          <p>- Whether you use deoptimization for division or anything

            else - like WebKit has done since before any of the Graal

            papers were written - is mostly unrelated to how you

            represent the division, unless you wanted to add a new

            intrinsic that is like a trapping-division-with-stackmap:</p>

          <p>result = call @trapping.div.with.stackmap(a, b, ... all of

            my state ...)</p>

          <p>Now, maybe you do want such an intrinsic, in which case you

            should propose it!  The reason why I haven't proposed it is

            that I think that long-term, the currently proposed

            intrinsics are a better path to getting the trapping

            optimizations.  See my previous mail, where I show how we

            could tell LLVM what the failing path is (which may have

            deoptimization code that uses a stackmap or whatever), what

            the trapping predicate is (it comes from the safe.div

            intrinsic), and the fact that trapping is wise (branch

            weights).</p>

          <p>- If you want to do the deoptimization with a trap, then

            your only choice currently is to use a patchpoint for the

            main path of the division.  This will be slower than using a

            branch to an OSR exit basic block, because you're making the

            division itself opaque to the optimizer (bad) just to get

            rid of a branch (which was probably cheap to begin with).</p>

          <p>Hence, what you want to do - one way or another, regardless

            of whether this proposed intrinsic is added - is to branch

            on the corner case condition, and have the slow case of the

            branch go to a basic block that deoptimizes.  Unless of

            course you have profiling that says that the case does

            happen often, in which case you can have that basic block

            handle the corner case inline without leaving optimized code

            (FWIW, we do have such paths in WebKit and they are useful).</p>

          <p>So the question for me is whether the branching involves

            explicit control flow or is hidden inside an intrinsic.  I

            prefer for it to be within an intrinsic because it:</p>

          <p>- allows the optimizer to do more interesting things in the

            common cases, like hoisting the entire division.</p>

          <p>- will give us a clearer path for implementing trapping

            optimizations in the future.</p>

          <p>- is an immediate win on ARM.</p>

          <p>I'd be curious to hear what specific idea you have about

            how to implement trap-based deoptimization with your

            trapping division intrinsic for x86 - maybe it's different

            from the two "bad" idioms I showed above.</p>

          <p>Finally, as for performance data, which part of this do you

            want performance data for?  I concede that I don't have

            performance data for using Michael's new intrinsic.  Part of

            what the intrinsic accomplishes is it gives a less ugly way

            of doing something that is already possible with target

            intrinsics on ARM.  I think it would be great if you could

            get those semantics - along with a known-good implementation

            - on other architectures as well.</p>

          <p>But this discussion has also involved suggestions that we

            should use trapping to implement deoptimization, and the

            main point of my message is to strongly argue against

            anything like this given the current state of trapping

            semantics and how patchpoints work.  My point is that using

            traps for division corner cases would either be unsound (see

            the stackmap after the trap, above), or would require you to

            do things that are obviously inefficient.  If you truly

            believe that the branch to detect division slow paths is

            more expensive than spilling all bytecode state to the stack

            or using a patchpoint for the division, then I could

            probably hack something up in WebKit to show you the

            performance implications.  (Or you could do it yourself, the

            code is open source...)</p>

          <p>-Filip</p>

        </div>

      </div>

    </blockquote>

    <br>

  </body>

</html>