<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <br>

    <div class="moz-cite-prefix">On 04/29/2014 10:44 AM, Filip Pizlo

      wrote:<br>

    </div>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <style>body{font-family:Helvetica,Arial;font-size:13px}</style>

      <div id="bloop_customfont"

        style="font-family:Helvetica,Arial;font-size:13px; color:

        rgba(0,0,0,1.0); margin: 0px; line-height: auto;">LD;DR: Your

        desire to use trapping on x86 only further convinces me that

        Michael's proposed intrinsics are the best way to go.</div>

    </blockquote>

    I'm still not convinced, but am not going to actively oppose it

    either.  I'm leery of designing a solution with major assumptions we

    don't have data to backup.  <br>

    <br>

    I worry your assumptions about deoptimization are potentially

    unsound.  But I don't have data to actually show this (yet).<br>

    <br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite"> <br>

      <p style="color:#000;">On April 29, 2014 at 10:09:49 AM, Philip

        Reames (<a moz-do-not-send="true"

          href="mailto:listmail@philipreames.com">listmail@philipreames.com</a>)

        wrote:</p>

      <div>

        <blockquote type="cite" class="clean_bq" style="color: rgb(0, 0,

          0); font-family: Helvetica, Arial; font-size: 13px;

          font-style: normal; font-variant: normal; font-weight: normal;

          letter-spacing: normal; line-height: normal; orphans: auto;

          text-align: start; text-indent: 0px; text-transform: none;

          white-space: normal; widows: auto; word-spacing: 0px;

          -webkit-text-stroke-width: 0px; background-color: rgb(255,

          255, 255);"><span>

            <div bgcolor="#FFFFFF" text="#000000">

              <div>As the discussion has progressed and I've spent more

                time thinking about the topic, I find myself less and

                less enthused about the current proposal.  I'm in full

                support of having idiomatic ways to express safe

                division, but I'm starting to doubt that using an

                intrinsic is the right way at the moment.<br>

                <br>

                One case I find myself thinking about is how one would

                combine profiling information and implicit

                div-by-zero/overflow checks with this proposal.  I don't

                really see a clean way.  Ideally, for a "safe div" which

                never has the exceptional paths taken, you'd like to

                completely do away with the control flow entirely.  (And

                rely on hardware traps w/exceptions instead.)  I don't

                really see a way to represent that type of construct

                given the current proposal. </div>

            </div>

          </span></blockquote>

      </div>

      <p>This is a deeper problem and to solve it you'd need a solution

        to trapping in general.  Let's consider the case of Java.  A

        Java program may want to catch the arithmetic exception due to

        divide by zero.  How would you do this with a trap in LLVM IR?

         Spill all state that is live at the catch?  Use a patchpoint

        for the entire division instruction?</p>

    </blockquote>

    We'd likely use something similar to a patchpoint.  You'd need the

    "abstract vm state" (which is not the compiled frame necessarily)

    available at the div instruction.  You could then re-enter the

    interpreter at the specified index (part of the vm state).  We have

    all most of these mechanisms in place.  Ideally, you'd trigger a

    recompile and otherwise ensure re-entry into compiled code at the

    soonest possible moment.  <br>

    <br>

    This requires a lot of runtime support, but we already have most of

    it implemented for another compiler.  From our perspective, the

    runtime requirements are not a major blocker.  <br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <p>In a lot of languages, a divide produces some result even in

        the exceptional case and this result requires effectively

        deoptimizing since the resut won't be the one you would have

        predicted (double instead of int, or BigInt instead of small

        int), which sort of means that if the CPU exception occurs you

        have to be able to reconstruct all state.  A patchpoint could do

        this, and so could spilling all state to the stack before the

        divide - but both are very heavy hammers that are sure to be

        more expensive than just doing a branch.</p>

    </blockquote>

    This isn't necessarily as expensive as you might believe.  I'd

    recommend reading the Graal project papers on this topic.<br>

    <br>

    Whether deopt or branching is more profitable *in this case*, I

    can't easily say.  I'm not yet to the point of being able to run

    that experiment.  We can argue about what "should" be better all we

    want, but real performance data is the only way to truly know.  <br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <p>For these reasons, WebKit has long gone with the approach of

        emitting branches even in our custom JITs; they are cheaper than

        the costs incurred by spilling or other craziness.  My bet is

        that the cheapest implementation in LLVM will involve some kind

        of branching on x86.</p>

    </blockquote>

    You may be right.  I simply don't want to design a system which

    assumes you are until I have data in hand.  <br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <blockquote type="cite" class="clean_bq" style="color: rgb(0,

            0, 0); font-family: Helvetica, Arial; font-size: 13px;

            font-style: normal; font-variant: normal; font-weight:

            normal; letter-spacing: normal; line-height: normal;

            orphans: auto; text-align: start; text-indent: 0px;

            text-transform: none; white-space: normal; widows: auto;

            word-spacing: 0px; -webkit-text-stroke-width: 0px;

            background-color: rgb(255, 255, 255);"><span>

              <div bgcolor="#FFFFFF" text="#000000">

                <div><br>

                  <br>

                  Another point that is bothering me is how these

                  intrinsics will interact with loop invariant code

                  motion, and other condition check optimizations.  We

                  have the potential to hide control flow from

                  optimizations which would otherwise remove it.  I'm

                  not stating these<span class="Apple-converted-space"> </span><i>aren't</i><span

                    class="Apple-converted-space"> </span>beneficial. 

                  I'm simply stating that I remain somewhat

                  unconvinced.  They seem like clear wins on some

                  architectures (i.e. ARM, where the hardware includes

                  the specific checks already), but not on others (i.e.

                  x86 with it's exceptions). </div>

              </div>

            </span></blockquote>

        </div>

        <p>I don't follow this.  One of the proposed intrinsic's

          advantages is precisely to help with things like loop

          invariant code motion by not emitting confusing control flow

          early.</p>

      </div>

    </blockquote>

    Er, not sure about this.  Consider:<br>

    for(int i = 0; i < n; i++) {<br>

      a[i] = a[i] / b;<br>

    }<br>

    <br>

    Yes, if b is a constant the conditions immediately fail away.  <br>

    <br>

    If you can't prove the bounds on b, you'd still like to transform

    this to:<br>

    if( b == 0 ) throw;<br>

    for(int i = 0; i < n; i++) {<br>

      if( a[i] == INT_MIN && b == -1) throw;<br>

      a[i] = a[i] / b;<br>

    }<br>

    <br>

    You might even want to split the loop into two versions (i.e. b is

    -1, otherwise).  These transforms wouldn't matter on ARM, but do

    matter on x86.  <br>

    <br>

    Even if you're using a deopt scheme mentioned above, this still

    could be profitable.  The frequencies of the two edge cases aren't

    necessarily correlated.  (i.e. you could be trying to divide by zero

    a lot)<br>

    <br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <div>

        <p>As for condition check optimizations, this has too low of

          profit for me to consider.  It's true that if you can prove

          that one of the operands to a division is a constant, then you

          probably don't want to emit all of the branches, and surely

          the proposed intrinsic allows for this.  But you're less

          likely to see code dividing by the same value repeatedly.  Or

          so goes my experience with this, anyway.</p>

      </div>

    </blockquote>

    I suspect your workloads may differ from mine in this area.  I also

    need to restate: I haven't spent much time looking at this yet.  I

    don't actually know if any of this matters.  Looking through the

    source code for the existing compilers, I suspect it does.  But

    that's not conclusive evidence.  <br>

    <br>

    p.s. w.r.t to terminology, I was being somewhat sloppy.  By

    condition check optimization, I meant everything from simplifycfg,

    cvp, and others.<br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <div>

            <blockquote type="cite" class="clean_bq" style="color:

              rgb(0, 0, 0); font-family: Helvetica, Arial; font-size:

              13px; font-style: normal; font-variant: normal;

              font-weight: normal; letter-spacing: normal; line-height:

              normal; orphans: auto; text-align: start; text-indent:

              0px; text-transform: none; white-space: normal; widows:

              auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;

              background-color: rgb(255, 255, 255);"><span>

                <div bgcolor="#FFFFFF" text="#000000">

                  <div><br>

                    <br>

                    Given the above, I'm going withdraw my active

                    support from the current proposal.  (Note: This does

                    not mean that I'm going to actively oppose it

                    either!)<br>

                    <br>

                    Let me make a counter proposal of my own.<br>

                    <br>

                    1) Let's not add generic intrinsics at this time. 

                    Instead, let's add cpu specific intrinsics with the

                    exact semantics of the associated div instructions.<br>

                      a) for ARM, this would be the semantics of the

                    current proposed instruction (right?)<br>

                      b) for x86, this would be an instruction which is

                    defined to fault on the corner cases (the resume

                    behaviour would<span class="Apple-converted-space"> </span><i>not</i><span

                      class="Apple-converted-space"> </span>be specified

                    for the moment)</div>

                </div>

              </span></blockquote>

          </div>

          <p>I don't like 1.b because the only thing you'd be able to do

            with the fault in the runtime is:</p>

          <p>- Unwind the entire function if you're careful enough and

            you don't mind losing all of its state.  This woud only

            apply to Java-like languages, and only if they didn't have

            any finally/catch(all)/catch(arithmetic) statements in the

            surrounding scope.</p>

          <p>- "interpret" the division to get your own semantics and

            resume execution, if your language allows for division to

            return an integer in the corner cases (i.e. JavaScript's

            (a/b)|0).  I concede that a trapping intrinsic would allow

            you to make this one case work, in cases where your

            profiling told you that the corner cases are rare.</p>

          <p>- Arrange to spill all state to the stack just to be able

            to throw an exception and/or deoptimize - i.e. you're crazy

            enough to think that the branch you just saved was more

            expensive than a bunch of stack traffic.</p>

          <p>I believe that the currently proposed intrinsics give us

            more of what we want on x86, than does your proposal, since

            they are more general, and since the branches will be

            cheaper than spilling everything to the stack.</p>

        </div>

      </div>

    </blockquote>

    I was trying to avoid getting into all of the details around deopt

    and resuming from traps.  :)  I agree this is a complex area, and

    frankly, it's not one I've explored enough in LLVM to have strong

    opinions on yet.  <br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p>To be clear, trap-based optimizations are something that we

            may eventually want to support but they require more magic

            unrelated to the specific issues of division.  I suspect

            that the nicest way of expressing trap-based optimizations

            in IR is to:</p>

          <p>- Have a branch in IR that is weighted in such a way as to

            convey to LLVM that we believe that one side is never taken

            and that therefore implementing that side in terms of traps

            is OK.</p>

          <p>- The thing being branched on is a predicate that is known

            to LLVM to be something that can be trapped on.</p>

        </div>

      </div>

    </blockquote>

    I agree with the above.  I think.  (I reserve the right to change my

    mind later...)<br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p>Notice that the proposed intrinsics are *the best* way of

            achieving this, once we have a way of optimizing such

            branches.  The LLVM backend will know that the i1's returned

            from the safe.div intrinsic are "trappable" on x86, and the

            frontend can arrange to weight the branches on those i1's

            appropriately if it wants a trap-based implementation.

             Better yet, the frontend could simply feed whatever

            profiling it wants ("I saw the corner case X times and the

            main case Y times") and LLVM can decide what to do on its

            own.  The fact that the machanism by which execution is

            diverted to the "trapping" basic block is by way of a trap

            could even be contained entirely within LLVM's innards.</p>

        </div>

      </div>

    </blockquote>

    I agree this sounds like a potentially workable design.  <br>

    <br>

    OT: Do we want to be implementing the trapping logic inside LLVM?  I

    would guess not.<br>

    <br>

    I actually agree with you on the general approach.  However, I'm not

    sure we're ready to get into this yet.  That's why I was proposing a

    simpler starting point.  <br>

    <br>

    My concern is how to way to balance the IR level CFG optimizations

    above with the trapping semantics and profiling data.  Until someone

    actually builds something, this is all *really* speculative.  I'd

    like to not lock down a design until we have *data* and *experience

    in LLVM*.  <br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <p>This could be used to support not only the unwinding

            semantics in Java on div/0, but also the (a/b)|0 in JS, and

            all of the others.  But first we need an intrinsic that

            reveals some predicate that is associated with the trapping

            condition, rather than having an intrinsic that is defined

            to simply trap.</p>

          <p>(Notice also how my approach to exposing trapping in LLVM

            could handle things like null checks - if LLVM sees a branch

            on a pointer to see if it's null in the close vicinity of a

            memory access, and the branch is weighted 100% in favor of

            non-null, then it can "fold" the branch and load together

            and use a trap.  I imagine this being super easy to

            implement and I'd be happy to say more about how I'd do it.

             So, to me, the issue with division isn't that we don't know

            how to trap on it - it's that we don't have a crystal-clear

            way of revealing the trapping conditions in IR.)</p>

        </div>

      </div>

    </blockquote>

    Agreed, particularly with the last statement.  <br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <div>

            <div>

              <blockquote type="cite" class="clean_bq" style="color:

                rgb(0, 0, 0); font-family: Helvetica, Arial; font-size:

                13px; font-style: normal; font-variant: normal;

                font-weight: normal; letter-spacing: normal;

                line-height: normal; orphans: auto; text-align: start;

                text-indent: 0px; text-transform: none; white-space:

                normal; widows: auto; word-spacing: 0px;

                -webkit-text-stroke-width: 0px; background-color:

                rgb(255, 255, 255);"><span>

                  <div bgcolor="#FFFFFF" text="#000000">

                    <div><br>

                      1b) Teach the optimizer about the semantics of

                      each - if we do go with a unified safe.div

                      instruction later, this effort could be mostly

                      reused.</div>

                  </div>

                </span></blockquote>

            </div>

            <p>Except for x86, where your proposal above doesn't work.</p>

          </div>

        </div>

      </div>

    </blockquote>

    Er, how so?  Same as the reasons we've discussed?  Or different?<br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <div>

            <div>

              <div>

                <blockquote type="cite" class="clean_bq" style="color:

                  rgb(0, 0, 0); font-family: Helvetica, Arial;

                  font-size: 13px; font-style: normal; font-variant:

                  normal; font-weight: normal; letter-spacing: normal;

                  line-height: normal; orphans: auto; text-align: start;

                  text-indent: 0px; text-transform: none; white-space:

                  normal; widows: auto; word-spacing: 0px;

                  -webkit-text-stroke-width: 0px; background-color:

                  rgb(255, 255, 255);"><span>

                    <div bgcolor="#FFFFFF" text="#000000">

                      <div><br>

                        2) Add optimizations (fairly late) which fold

                        generic-div+control flow into machine specific

                        div instructions.</div>

                    </div>

                  </span></blockquote>

              </div>

              <p>This is purely aspirational.  There are many ways of

                doing the control flow for the division since there is

                quite a bit of it.  Are you going to pattern match all

                of them?  What if you miss a case because one of the

                many IR optimizations transformed it into something that

                is no longer recognizable?  Or are you going to impose

                rules for canonicalizing any control flow around any

                division to ensure that a div-by-zero and INT_MIN/-1

                check never stops looking like the pattern that you

                wanted it to look like?</p>

              <p>To me this sounds kind of crazy.  Once the frontend has

                determined that it wants safe division, it should be

                able to say this explicitly enough that this information

                is preserved through compilation.</p>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    You don't need to recognize all patterns; you don't even want to. 

    See above for LICM and related.<br>

    <br>

    I suspect this is less hard than it initially sounds.  Particularly

    since you really only need to identify cases in hot code.  If the

    control flow was moved outside of the loop, it is almost certainly

    not hot any more.  <br>

    <br>

    Note: This scheme is definitely biased towards x86.  It probably is

    NOT the optimal scheme for ARM.  I freely acknowledge that.  I'm

    just not happy with the way the previous proposal seems to prefer

    ARM over x86.  <br>

    <blockquote cite="mid:etPan.535fe4f0.140e0f76.172db@dethklok.local"

      type="cite">

      <div>

        <div>

          <div>

            <div>

              <div>

                <blockquote type="cite" class="clean_bq" style="color:

                  rgb(0, 0, 0); font-family: Helvetica, Arial;

                  font-size: 13px; font-style: normal; font-variant:

                  normal; font-weight: normal; letter-spacing: normal;

                  line-height: normal; orphans: auto; text-align: start;

                  text-indent: 0px; text-transform: none; white-space:

                  normal; widows: auto; word-spacing: 0px;

                  -webkit-text-stroke-width: 0px; background-color:

                  rgb(255, 255, 255);"><span>

                    <div bgcolor="#FFFFFF" text="#000000">

                      <div><br>

                        3) Frontends that want machine specific

                        semantics (i.e. trap or defined edge cases), can

                        use the machine specific intrinsics directly. <br>

                        4) We provide a set of IR helper functions which

                        implement safe division in what we believe to be

                        the best way.  This can be different for each

                        supported platform.  This could either be the

                        form of sample IR, or even as functions on

                        IRBuilder. <br>

                        5) We spend some time optimizing the IR

                        generated by (4) and use that to inform our

                        discussion about what is truly necessary from a

                        performance standpoint. <br>

                        <br>

                        Once this is in place, we can come back to the

                        discussion of what an appropriate generic safe

                        div intrinsic would look like.  In particular,

                        we would have the performance data at hand to

                        lay to rest the concerns raised by myself and

                        Eric.  My strong suspicion is that the counter

                        proposal will be "good enough" in practice for

                        most users. <br>

                        <br>

                        p.s. If we do go ahead with the current

                        proposal, I'm going to *strongly* request they

                        be marked experimental.  I think we'll need to

                        iterate on these a few times. <br>

                        <br>

                        Philip<br>

                        <br>

                      </div>

                    </div>

                  </span></blockquote>

              </div>

            </div>

          </div>

        </div>

      </div>

    </blockquote>

    <br>

  </body>

</html>