<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">On 04/29/2014 12:39 PM, Filip Pizlo

      wrote:<br>

    </div>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <style>body{font-family:Helvetica,Arial;font-size:13px}</style>On

      April 29, 2014 at 11:27:06 AM, Philip Reames (<a

        moz-do-not-send="true" href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)

      wrote:

      <div>

        <blockquote type="cite" class="clean_bq" style="color: rgb(0, 0,

          0); font-family: Helvetica, Arial; font-size: 13px;

          font-style: normal; font-variant: normal; font-weight: normal;

          letter-spacing: normal; line-height: normal; orphans: auto;

          text-align: start; text-indent: 0px; text-transform: none;

          white-space: normal; widows: auto; word-spacing: 0px;

          -webkit-text-stroke-width: 0px; background-color: rgb(255,

          255, 255);"><span>

            <div bgcolor="#FFFFFF" text="#000000">

              <div>

                <div class="moz-cite-prefix">On 04/29/2014 10:44 AM,

                  Filip Pizlo wrote:<br>

                </div>

                <blockquote

                  cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

                  type="cite">

                  <div id="bloop_customfont" style="font-family:

                    Helvetica, Arial; font-size: 13px; color: rgb(0, 0,

                    0); margin: 0px;">LD;DR: Your desire to use trapping

                    on x86 only further convinces me that Michael's

                    proposed intrinsics are the best way to go.</div>

                </blockquote>

                I'm still not convinced, but am not going to actively

                oppose it either.  I'm leery of designing a solution

                with major assumptions we don't have data to backup. <br>

                <br>

                I worry your assumptions about deoptimization are

                potentially unsound.  But I don't have data to actually

                show this (yet).</div>

            </div>

          </span></blockquote>

      </div>

      <p>I *think* I may have been unclear about my assumptions; in

        particular, my claims with respect to deoptimization are

        probably more subtle than they appeared.  WebKit can use LLVM

        and it has divisions and we do all possible

        deoptimization/profiling/etc tricks for it, so this is grounded

        in experience.  Forgive me if the rest of this e-mail contains a

        lecture on things that are obvious - I'll try to err on the side

        of clarity and completeness since this discussion is

        sufficiently dense that we run the risk of talking

        cross-purposes unless some baseline assumptions are established.</p>

    </blockquote>

    I think we're using the same terminology, but with slightly

    different sets of assumptions.  I'll point this out below where

    relevant.  <br>

    <br>

    Also, thanks for taking the time to expand.  It help clarify the

    discussion quite a bit.  <br>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <blockquote type="cite" class="clean_bq" style="color: rgb(0,

            0, 0); font-family: Helvetica, Arial; font-size: 13px;

            font-style: normal; font-variant: normal; font-weight:

            normal; letter-spacing: normal; line-height: normal;

            orphans: auto; text-align: start; text-indent: 0px;

            text-transform: none; white-space: normal; widows: auto;

            word-spacing: 0px; -webkit-text-stroke-width: 0px;

            background-color: rgb(255, 255, 255);"><span>

              <div bgcolor="#FFFFFF" text="#000000">

                <div><br>

                  <br>

                  <blockquote

                    cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

                    type="cite"><br>

                    <p style="color: rgb(0, 0, 0);">On April 29, 2014 at

                      10:09:49 AM, Philip Reames (<a

                        moz-do-not-send="true"

                        href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)

                      wrote:</p>

                    <div>

                      <blockquote type="cite" class="clean_bq"

                        style="color: rgb(0, 0, 0); font-family:

                        Helvetica, Arial; font-size: 13px; font-style:

                        normal; font-variant: normal; font-weight:

                        normal; letter-spacing: normal; line-height:

                        normal; orphans: auto; text-align: start;

                        text-indent: 0px; text-transform: none;

                        white-space: normal; widows: auto; word-spacing:

                        0px; -webkit-text-stroke-width: 0px;

                        background-color: rgb(255, 255, 255);">

                        <div bgcolor="#FFFFFF" text="#000000">

                          <div><span>As the discussion has progressed

                              and I've spent more time thinking about

                              the topic, I find myself less and less

                              enthused about the current proposal.  I'm

                              in full support of having idiomatic ways

                              to express safe division, but I'm starting

                              to doubt that using an intrinsic is the

                              right way at the moment.<br>

                              <br>

                              One case I find myself thinking about is

                              how one would combine profiling

                              information and implicit

                              div-by-zero/overflow checks with this

                              proposal.  I don't really see a clean

                              way.  Ideally, for a "safe div" which

                              never has the exceptional paths taken,

                              you'd like to completely do away with the

                              control flow entirely.  (And rely on

                              hardware traps w/exceptions instead.)  I

                              don't really see a way to represent that

                              type of construct given the current

                              proposal. </span></div>

                        </div>

                      </blockquote>

                    </div>

                    <p>This is a deeper problem and to solve it you'd

                      need a solution to trapping in general.  Let's

                      consider the case of Java.  A Java program may

                      want to catch the arithmetic exception due to

                      divide by zero.  How would you do this with a trap

                      in LLVM IR?  Spill all state that is live at the

                      catch?  Use a patchpoint for the entire division

                      instruction?</p>

                  </blockquote>

                  We'd likely use something similar to a patchpoint. 

                  You'd need the "abstract vm state" (which is not the

                  compiled frame necessarily) available at the div

                  instruction.  You could then re-enter the interpreter

                  at the specified index (part of the vm state).  We

                  have all most of these mechanisms in place.  Ideally,

                  you'd trigger a recompile and otherwise ensure

                  re-entry into compiled code at the soonest possible

                  moment. <br>

                  <br>

                  This requires a lot of runtime support, but we already

                  have most of it implemented for another compiler. 

                  From our perspective, the runtime requirements are not

                  a major blocker. </div>

              </div>

            </span></blockquote>

        </div>

        <p>Right, you'll use a patchpoint.  That's way more expensive

          than using a safe division intrinsic with branches, because

          it's opaque to the optimizer.</p>

      </div>

    </blockquote>

    This statement is true at the moment, but it shouldn't be.  I think

    this is our fundamental difference in approach.  <br>

    <br>

    You should be able to write something like:<br>

    i32 %res = invoke patchpoint (... x86_trapping_divide, a, b)

    normal_dest invoke_dest<br>

    <br>

    normal_dest:<br>

      ;; use %res<br>

    invoke_dest:<br>

      landingpad<br>

      ;; dispatch edge cases<br>

      ;; this could be unreachable code if you deopt this frame in the

    trap handler and jump directly to an interpreter or other bit of

    code<br>

    <br>

    A patchpoint should not require any excess spilling.  If values are

    live in registers, that should be reflected in the stack map.  (I do

    not know if this is the case for patchpoint at the moment or not.)<br>

    <br>

    The Value called by a patchpoint should participate in optimization

    normally.  We really want the patchpoint part of the call to be

    supplemental.  It should still be a call.  It should be constant

    propagated, transformed, etc..  This is not the case currently. 

    I've got a couple of off the wall ideas for improving the current

    status, but I'll agree this is a hardish problem.  <br>

    <br>

    It should be legal to use a patchpoint in an invoke.  It's an ABI

    issue of how the invoke path gets invoked.  (i.e. side tables for

    the runtime to lookup, etc..)  This is not possible today, and

    probably requires a fair amount of work.  Some of it, I've already

    done and will be sharing shortly.  Other parts, I haven't even

    thought about.  <br>

    <br>

    If you didn't want to use the trapping semantics, you'd insert

    dedicated control flow _before_ the divide.  This would allow normal

    optimization of the control flow.  <br>

    <br>

    Notes:<br>

    1) This might require a new PATCHPOINT pseudo op in the backend. 

    Haven't thought much about that yet.<br>

    2) I *think* your current intrinsic could be translated into

    something like this.  (Leaving aside the question of where the deopt

    state comes from.)  In fact, the more I look at this, the less

    difference I actually see between the approaches.  <br>

    <br>

    <br>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <div>

            <blockquote type="cite" class="clean_bq" style="color:

              rgb(0, 0, 0); font-family: Helvetica, Arial; font-size:

              13px; font-style: normal; font-variant: normal;

              font-weight: normal; letter-spacing: normal; line-height:

              normal; orphans: auto; text-align: start; text-indent:

              0px; text-transform: none; white-space: normal; widows:

              auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;

              background-color: rgb(255, 255, 255);"><span>

                <div bgcolor="#FFFFFF" text="#000000">

                  <div>

                    <blockquote

                      cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

                      type="cite">

                      <p><br class="Apple-interchange-newline">

                        In a lot of languages, a divide produces some

                        result even in the exceptional case and this

                        result requires effectively deoptimizing since

                        the resut won't be the one you would have

                        predicted (double instead of int, or BigInt

                        instead of small int), which sort of means that

                        if the CPU exception occurs you have to be able

                        to reconstruct all state.  A patchpoint could do

                        this, and so could spilling all state to the

                        stack before the divide - but both are very

                        heavy hammers that are sure to be more expensive

                        than just doing a branch.</p>

                    </blockquote>

                    This isn't necessarily as expensive as you might

                    believe.  I'd recommend reading the Graal project

                    papers on this topic.<br>

                    <br>

                    Whether deopt or branching is more profitable *in

                    this case*, I can't easily say.  I'm not yet to the

                    point of being able to run that experiment.  We can

                    argue about what "should" be better all we want, but

                    real performance data is the only way to truly

                    know. </div>

                </div>

              </span></blockquote>

          </div>

          <p>My point may have been confusing.  I know that

            deoptimization is cheap and WebKit uses it everywhere,

            including division corner cases, if profiling tells us that

            it's profitable to do so (which it does, in the common

            case).  WebKit is a heavy user of deoptimization in general,

            so you don't need to convince me that it's worth it.</p>

        </div>

      </div>

    </blockquote>

    Acknowledged.  <br>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p>Note that I want *both* deopt *and* branching, because in

            this case, a branch is the fastest overall way of detecting

            when to deopt.  In the future, I will want to implement the

            deopt in terms of branching, and when we do this, I believe

            that the most sound and performat approach would be using

            Michael's intrinsics.  This is subtle and I'll try to

            explain why it's the case.</p>

          <p>The point is that you wouldn't want to do deoptimization by

            spilling state on the main path or by using a patchpoint for

            the main path of the division.</p>

        </div>

      </div>

    </blockquote>

    This is the main point I disagree with.  I don't believe that having

    a patchpoint on the main path should be any more expensive then the

    original call.  (see above)<br>

    <br>

    Worth noting explicitly: I'm assuming that all of your deopt state

    would already be available for other purposes in nearby code.  It's

    on the stack or in registers.  I'm assuming that by adding the deopt

    point, you are not radically changing the set of computations which

    need to be done.  If that's not the case, you should avoid deopt and

    instead just inline the slow paths with explicit checks.  <br>

    <br>

    I'll note that given your assumptions about the cost of a

    patchpoint, the rest of your position makes a lot more sense.  :) 

    As I spelled out above, I believe this cost is not fundamental.  <br>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p>You don't want the common path of executing the division to

            involve a patchpoint instruction, although using a

            patchpoint or stackmap to implement deoptimization on the

            failing path is great:</p>

          <p><b>Good: </b>if (division would fail) { call

            @patchpoint(all of my state) } else { result = a / b }</p>

        </div>

      </div>

    </blockquote>

    Given your cost assumptions, I'd agree.  <br>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p><b>Bad: </b>call @patchpoint(all of my state) // patch

            with a divide instruction - bad because the optimizer has no

            clue what you're doing and assumes the very worst</p>

        </div>

      </div>

    </blockquote>

    Yuck.  Agreed.  <br>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p><b>Worse: </b>spill all state to the stack; call

            @trapping.div(a, b) // the spills will hurt you far more

            than a branch, so this should be avoided</p>

        </div>

      </div>

    </blockquote>

    I'm assuming this is an explicit spill rather than simply recording

    a stack map *at the div*.  If so, agreed.  <br>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p>I suppose we could imagine a fourth option that involves a

            patchpoint to pick up the state and a trapping divide

            instrinsic.  But a trapping divide intrinsic alone is not

            enough.  Consider this:</p>

          <p>result = call @trapping.div(a, b); call @stackmap(all of my

            state)</p>

          <p>As soon as these are separate instructions, you have no

            guarantees that the state that the stackmap reports is sound

            for the point at which the div would trap.  <br>

          </p>

        </div>

      </div>

    </blockquote>

    This is the closest to what I'd propose, except that the two calls

    would be merged into a single patchpoint.  Isn't the entire point of

    a patchpoint to record the stack map for a call?  (Well, ignoring

    the actual patching part..)  Why not write this as:<br>

    patchpoint(..., trapping.div, a, b);<br>

    <br>

    Is there something I'm missing here?<br>

    <br>

    Just to note: I fully agree that the two call proposal is unsound

    and should be avoided.  <br>

    <br>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p>So, the division itself shouldn't be a trapping instruction

            and instead you want to detect the bad case with a branch.</p>

          <p>To be clear:</p>

          <p>- Whether you use deoptimization for division or anything

            else - like WebKit has done since before any of the Graal

            papers were written - is mostly unrelated to how you

            represent the division, unless you wanted to add a new

            intrinsic that is like a trapping-division-with-stackmap:</p>

          <p>result = call @trapping.div.with.stackmap(a, b, ... all of

            my state ...)</p>

          <p>Now, maybe you do want such an intrinsic, in which case you

            should propose it!  <br>

          </p>

        </div>

      </div>

    </blockquote>

    Given what we already have with patchpoints, I don't think a merged

    intrinsic is necessary.  (See above).  I believe we have all the

    parts to build this solution, and that - in theory - they should

    compose neatly.<br>

    <br>

    p.s. The bits I was referring to was not deopt per se.  It was

    particularly which set of deopt state you used for each deopt

    point.  That's a bit of tangent for the rest of the discussion now

    though.  <br>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p>The reason why I haven't proposed it is that I think that

            long-term, the currently proposed intrinsics are a better

            path to getting the trapping optimizations.  See my previous

            mail, where I show how we could tell LLVM what the failing

            path is (which may have deoptimization code that uses a

            stackmap or whatever), what the trapping predicate is (it

            comes from the safe.div intrinsic), and the fact that

            trapping is wise (branch weights).</p>

          <p>- If you want to do the deoptimization with a trap, then

            your only choice currently is to use a patchpoint for the

            main path of the division.  This will be slower than using a

            branch to an OSR exit basic block, because you're making the

            division itself opaque to the optimizer (bad) just to get

            rid of a branch (which was probably cheap to begin with).</p>

          <p>Hence, what you want to do - one way or another, regardless

            of whether this proposed intrinsic is added - is to branch

            on the corner case condition, and have the slow case of the

            branch go to a basic block that deoptimizes.  Unless of

            course you have profiling that says that the case does

            happen often, in which case you can have that basic block

            handle the corner case inline without leaving optimized code

            (FWIW, we do have such paths in WebKit and they are useful).</p>

          <p>So the question for me is whether the branching involves

            explicit control flow or is hidden inside an intrinsic.  I

            prefer for it to be within an intrinsic because it:</p>

          <p>- allows the optimizer to do more interesting things in the

            common cases, like hoisting the entire division.</p>

          <p>- will give us a clearer path for implementing trapping

            optimizations in the future.</p>

          <p>- is an immediate win on ARM.</p>

          <p>I'd be curious to hear what specific idea you have about

            how to implement trap-based deoptimization with your

            trapping division intrinsic for x86 - maybe it's different

            from the two "bad" idioms I showed above.</p>

        </div>

      </div>

    </blockquote>

    I hope my explanation above helps.  If not, ask, and I'll try to

    explain more clearly.  <br>

    <br>

    One point just for clarity; I don't believe this effects the

    conclusions of our discussion so far.  I'm also fairly sure that you

    (Filip) are aware of this, but want to spell it out for other

    readers.  <br>

    <br>

    You seem to be assuming that compiled code needs to explicitly

    branch to a point where deopt state is known to exit a compiled

    frame.  Worth noting is that you can also exit a compiled frame on a

    trap (without an explicitly condition/branch!) if the deopt state is

    known at the point you take the trap.  This "exit frame on trap"

    behavior shows up with null pointer exceptions as well.  I'll note

    that both compilers in OpenJDK support some combination of

    "exit-on-trap" conditions for division and null dereferences.  The

    two differ on exactly what strategies they use in each case though. 

    :)<br>

    <br>

    I'm not really arguing that either scheme is "better" in all cases. 

    I'm simply arguing that we should support both and allow

    optimization and tuning between them.  As far as I can tell, you

    seem to be assuming that an explicit branch to known exit point is

    always superior. <br>

    <br>

    <br>

    Ok, back to the topic at hand...<br>

    <br>

    With regards to the current proposal, I'm going to take a step

    back.  You guys seem to have already looked in this in a fair amount

    of depth.  I'm not necessarily convinced you've come to the best

    solution, but at some point, we need to make forward progress.  What

    you have is clearly better than nothing.  <br>

    <br>

    Please go ahead and submit your current approach.  We can come back

    and revise later if we really need to.  <br>

    <br>

    I do request the following changes:<br>

    - Mark it clearly as experimental.<br>

    - Either don't specify the value computed in the edge cases, or

    allow those values to be specified as constant arguments to the

    call.  This allows efficient lowering to x86's div instruction if

    you want to make use of the trapping semantics.  <br>

    <br>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p>Finally, as for performance data, which part of this do you

            want performance data for?  I concede that I don't have

            performance data for using Michael's new intrinsic.  Part of

            what the intrinsic accomplishes is it gives a less ugly way

            of doing something that is already possible with target

            intrinsics on ARM.  I think it would be great if you could

            get those semantics - along with a known-good implementation

            - on other architectures as well.</p>

        </div>

      </div>

    </blockquote>

    I would be very interested in seeing data comparing two schemes:<br>

    - Explicit control flow emited by the frontend<br>

    - The safe.div intrinsic emitted by the frontend, desugared in

    CodeGenPrep<br>

    <br>

    My strong suspicion is that each would preform well in some cases

    and not in others.  At least on x86.  Since the edge-checks are

    essentially free on ARM, the second scheme would probably be

    strictly superior there.  <br>

    <br>

    I am NOT asking that we block submission on this data however.  <br>

    <br>

    <blockquote cite="mid:etPan.5360000d.4e6afb66.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p>But this discussion has also involved suggestions that we

            should use trapping to implement deoptimization, and the

            main point of my message is to strongly argue against

            anything like this given the current state of trapping

            semantics and how patchpoints work.  My point is that using

            traps for division corner cases would either be unsound (see

            the stackmap after the trap, above), or would require you to

            do things that are obviously inefficient.  If you truly

            believe that the branch to detect division slow paths is

            more expensive than spilling all bytecode state to the stack

            or using a patchpoint for the division, then I could

            probably hack something up in WebKit to show you the

            performance implications.  (Or you could do it yourself, the

            code is open source...)</p>

        </div>

      </div>

    </blockquote>

    In a couple of months, I'll probably have the performance data to

    discuss this for real.  When that happens, let's pick this up and

    continue the debate.  Alternatively, if you want to chat this over

    more with a beer in hand at the social next week, let me know.  In

    the meantime, let's not stall the current proposal any more.  <br>

    <br>

    Philip<br>

    <br>

  </body>

</html>