[LLVMdev] Proposal: add intrinsics for safe division

Tue Apr 29 11:30:34 PDT 2014

I have nothing to say that Philip hasn't already said so I'll just add
a "me too".

Thanks!

-eric

On Tue, Apr 29, 2014 at 11:27 AM, Philip Reames
<listmail at philipreames.com> wrote:
>
> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>
> LD;DR: Your desire to use trapping on x86 only further convinces me that
> Michael's proposed intrinsics are the best way to go.
>
> I'm still not convinced, but am not going to actively oppose it either.  I'm
> leery of designing a solution with major assumptions we don't have data to
> backup.
>
> I worry your assumptions about deoptimization are potentially unsound.  But
> I don't have data to actually show this (yet).
>
>
>
> On April 29, 2014 at 10:09:49 AM, Philip Reames (listmail at philipreames.com)
> wrote:
>
> As the discussion has progressed and I've spent more time thinking about the
> topic, I find myself less and less enthused about the current proposal.  I'm
> in full support of having idiomatic ways to express safe division, but I'm
> starting to doubt that using an intrinsic is the right way at the moment.
>
> One case I find myself thinking about is how one would combine profiling
> information and implicit div-by-zero/overflow checks with this proposal.  I
> don't really see a clean way.  Ideally, for a "safe div" which never has the
> exceptional paths taken, you'd like to completely do away with the control
> flow entirely.  (And rely on hardware traps w/exceptions instead.)  I don't
> really see a way to represent that type of construct given the current
> proposal.
>
> This is a deeper problem and to solve it you'd need a solution to trapping
> in general.  Let's consider the case of Java.  A Java program may want to
> catch the arithmetic exception due to divide by zero.  How would you do this
> with a trap in LLVM IR?  Spill all state that is live at the catch?  Use a
> patchpoint for the entire division instruction?
>
> We'd likely use something similar to a patchpoint.  You'd need the "abstract
> vm state" (which is not the compiled frame necessarily) available at the div
> instruction.  You could then re-enter the interpreter at the specified index
> (part of the vm state).  We have all most of these mechanisms in place.
> Ideally, you'd trigger a recompile and otherwise ensure re-entry into
> compiled code at the soonest possible moment.
>
> This requires a lot of runtime support, but we already have most of it
> implemented for another compiler.  From our perspective, the runtime
> requirements are not a major blocker.
>
> In a lot of languages, a divide produces some result even in the exceptional
> case and this result requires effectively deoptimizing since the resut won't
> be the one you would have predicted (double instead of int, or BigInt
> instead of small int), which sort of means that if the CPU exception occurs
> you have to be able to reconstruct all state.  A patchpoint could do this,
> and so could spilling all state to the stack before the divide - but both
> are very heavy hammers that are sure to be more expensive than just doing a
> branch.
>
> This isn't necessarily as expensive as you might believe.  I'd recommend
> reading the Graal project papers on this topic.
>
> Whether deopt or branching is more profitable *in this case*, I can't easily
> say.  I'm not yet to the point of being able to run that experiment.  We can
> argue about what "should" be better all we want, but real performance data
> is the only way to truly know.
>
> For these reasons, WebKit has long gone with the approach of emitting
> branches even in our custom JITs; they are cheaper than the costs incurred
> by spilling or other craziness.  My bet is that the cheapest implementation
> in LLVM will involve some kind of branching on x86.
>
> You may be right.  I simply don't want to design a system which assumes you
> are until I have data in hand.
>
>
>
> Another point that is bothering me is how these intrinsics will interact
> with loop invariant code motion, and other condition check optimizations.
> We have the potential to hide control flow from optimizations which would
> otherwise remove it.  I'm not stating these aren't beneficial.  I'm simply
> stating that I remain somewhat unconvinced.  They seem like clear wins on
> some architectures (i.e. ARM, where the hardware includes the specific
> checks already), but not on others (i.e. x86 with it's exceptions).
>
> I don't follow this.  One of the proposed intrinsic's advantages is
> precisely to help with things like loop invariant code motion by not
> emitting confusing control flow early.
>
> Er, not sure about this.  Consider:
> for(int i = 0; i < n; i++) {
>
>   a[i] = a[i] / b;
> }
>
> Yes, if b is a constant the conditions immediately fail away.
>
> If you can't prove the bounds on b, you'd still like to transform this to:
> if( b == 0 ) throw;
> for(int i = 0; i < n; i++) {
>   if( a[i] == INT_MIN && b == -1) throw;
>
>   a[i] = a[i] / b;
> }
>
> You might even want to split the loop into two versions (i.e. b is -1,
> otherwise).  These transforms wouldn't matter on ARM, but do matter on x86.
>
> Even if you're using a deopt scheme mentioned above, this still could be
> profitable.  The frequencies of the two edge cases aren't necessarily
> correlated.  (i.e. you could be trying to divide by zero a lot)
>
>
> As for condition check optimizations, this has too low of profit for me to
> consider.  It's true that if you can prove that one of the operands to a
> division is a constant, then you probably don't want to emit all of the
> branches, and surely the proposed intrinsic allows for this.  But you're
> less likely to see code dividing by the same value repeatedly.  Or so goes
> my experience with this, anyway.
>
> I suspect your workloads may differ from mine in this area.  I also need to
> restate: I haven't spent much time looking at this yet.  I don't actually
> know if any of this matters.  Looking through the source code for the
> existing compilers, I suspect it does.  But that's not conclusive evidence.
>
> p.s. w.r.t to terminology, I was being somewhat sloppy.  By condition check
> optimization, I meant everything from simplifycfg, cvp, and others.
>
>
>
> Given the above, I'm going withdraw my active support from the current
> proposal.  (Note: This does not mean that I'm going to actively oppose it
> either!)
>
> Let me make a counter proposal of my own.
>
> 1) Let's not add generic intrinsics at this time.  Instead, let's add cpu
> specific intrinsics with the exact semantics of the associated div
> instructions.
>   a) for ARM, this would be the semantics of the current proposed
> instruction (right?)
>   b) for x86, this would be an instruction which is defined to fault on the
> corner cases (the resume behaviour would not be specified for the moment)
>
> I don't like 1.b because the only thing you'd be able to do with the fault
> in the runtime is:
>
> - Unwind the entire function if you're careful enough and you don't mind
> losing all of its state.  This woud only apply to Java-like languages, and
> only if they didn't have any finally/catch(all)/catch(arithmetic) statements
> in the surrounding scope.
>
> - "interpret" the division to get your own semantics and resume execution,
> if your language allows for division to return an integer in the corner
> cases (i.e. JavaScript's (a/b)|0).  I concede that a trapping intrinsic
> would allow you to make this one case work, in cases where your profiling
> told you that the corner cases are rare.
>
> - Arrange to spill all state to the stack just to be able to throw an
> exception and/or deoptimize - i.e. you're crazy enough to think that the
> branch you just saved was more expensive than a bunch of stack traffic.
>
> I believe that the currently proposed intrinsics give us more of what we
> want on x86, than does your proposal, since they are more general, and since
> the branches will be cheaper than spilling everything to the stack.
>
> I was trying to avoid getting into all of the details around deopt and
> resuming from traps.  :)  I agree this is a complex area, and frankly, it's
> not one I've explored enough in LLVM to have strong opinions on yet.
>
> To be clear, trap-based optimizations are something that we may eventually
> want to support but they require more magic unrelated to the specific issues
> of division.  I suspect that the nicest way of expressing trap-based
> optimizations in IR is to:
>
> - Have a branch in IR that is weighted in such a way as to convey to LLVM
> that we believe that one side is never taken and that therefore implementing
> that side in terms of traps is OK.
>
> - The thing being branched on is a predicate that is known to LLVM to be
> something that can be trapped on.
>
> I agree with the above.  I think.  (I reserve the right to change my mind
> later...)
>
> Notice that the proposed intrinsics are *the best* way of achieving this,
> once we have a way of optimizing such branches.  The LLVM backend will know
> that the i1's returned from the safe.div intrinsic are "trappable" on x86,
> and the frontend can arrange to weight the branches on those i1's
> appropriately if it wants a trap-based implementation.  Better yet, the
> frontend could simply feed whatever profiling it wants ("I saw the corner
> case X times and the main case Y times") and LLVM can decide what to do on
> its own.  The fact that the machanism by which execution is diverted to the
> "trapping" basic block is by way of a trap could even be contained entirely
> within LLVM's innards.
>
> I agree this sounds like a potentially workable design.
>
> OT: Do we want to be implementing the trapping logic inside LLVM?  I would
> guess not.
>
> I actually agree with you on the general approach.  However, I'm not sure
> we're ready to get into this yet.  That's why I was proposing a simpler
> starting point.
>
> My concern is how to way to balance the IR level CFG optimizations above
> with the trapping semantics and profiling data.  Until someone actually
> builds something, this is all *really* speculative.  I'd like to not lock
> down a design until we have *data* and *experience in LLVM*.
>
> This could be used to support not only the unwinding semantics in Java on
> div/0, but also the (a/b)|0 in JS, and all of the others.  But first we need
> an intrinsic that reveals some predicate that is associated with the
> trapping condition, rather than having an intrinsic that is defined to
> simply trap.
>
> (Notice also how my approach to exposing trapping in LLVM could handle
> things like null checks - if LLVM sees a branch on a pointer to see if it's
> null in the close vicinity of a memory access, and the branch is weighted
> 100% in favor of non-null, then it can "fold" the branch and load together
> and use a trap.  I imagine this being super easy to implement and I'd be
> happy to say more about how I'd do it.  So, to me, the issue with division
> isn't that we don't know how to trap on it - it's that we don't have a
> crystal-clear way of revealing the trapping conditions in IR.)
>
> Agreed, particularly with the last statement.
>
>
> 1b) Teach the optimizer about the semantics of each - if we do go with a
> unified safe.div instruction later, this effort could be mostly reused.
>
> Except for x86, where your proposal above doesn't work.
>
> Er, how so?  Same as the reasons we've discussed?  Or different?
>
>
> 2) Add optimizations (fairly late) which fold generic-div+control flow into
> machine specific div instructions.
>
> This is purely aspirational.  There are many ways of doing the control flow
> for the division since there is quite a bit of it.  Are you going to pattern
> match all of them?  What if you miss a case because one of the many IR
> optimizations transformed it into something that is no longer recognizable?
> Or are you going to impose rules for canonicalizing any control flow around
> any division to ensure that a div-by-zero and INT_MIN/-1 check never stops
> looking like the pattern that you wanted it to look like?
>
> To me this sounds kind of crazy.  Once the frontend has determined that it
> wants safe division, it should be able to say this explicitly enough that
> this information is preserved through compilation.
>
> You don't need to recognize all patterns; you don't even want to.  See above
> for LICM and related.
>
> I suspect this is less hard than it initially sounds.  Particularly since
> you really only need to identify cases in hot code.  If the control flow was
> moved outside of the loop, it is almost certainly not hot any more.
>
> Note: This scheme is definitely biased towards x86.  It probably is NOT the
> optimal scheme for ARM.  I freely acknowledge that.  I'm just not happy with
> the way the previous proposal seems to prefer ARM over x86.
>
>
> 3) Frontends that want machine specific semantics (i.e. trap or defined edge
> cases), can use the machine specific intrinsics directly.
> 4) We provide a set of IR helper functions which implement safe division in
> what we believe to be the best way.  This can be different for each
> supported platform.  This could either be the form of sample IR, or even as
> functions on IRBuilder.
> 5) We spend some time optimizing the IR generated by (4) and use that to
> inform our discussion about what is truly necessary from a performance
> standpoint.
>
> Once this is in place, we can come back to the discussion of what an
> appropriate generic safe div intrinsic would look like.  In particular, we
> would have the performance data at hand to lay to rest the concerns raised
> by myself and Eric.  My strong suspicion is that the counter proposal will
> be "good enough" in practice for most users.
>
> p.s. If we do go ahead with the current proposal, I'm going to *strongly*
> request they be marked experimental.  I think we'll need to iterate on these
> a few times.
>
> Philip
>
>