[LLVMdev] Proposal: add intrinsics for safe division

Tue Apr 29 10:44:16 PDT 2014

LD;DR: Your desire to use trapping on x86 only further convinces me that Michael's proposed intrinsics are the best way to go.

On April 29, 2014 at 10:09:49 AM, Philip Reames (listmail at philipreames.com) wrote:

As the discussion has progressed and I've spent more time thinking about the topic, I find myself less and less enthused about the current proposal.  I'm in full support of having idiomatic ways to express safe division, but I'm starting to doubt that using an intrinsic is the right way at the moment.

One case I find myself thinking about is how one would combine profiling information and implicit div-by-zero/overflow checks with this proposal.  I don't really see a clean way.  Ideally, for a "safe div" which never has the exceptional paths taken, you'd like to completely do away with the control flow entirely.  (And rely on hardware traps w/exceptions instead.)  I don't really see a way to represent that type of construct given the current proposal. 
This is a deeper problem and to solve it you'd need a solution to trapping in general.  Let's consider the case of Java.  A Java program may want to catch the arithmetic exception due to divide by zero.  How would you do this with a trap in LLVM IR?  Spill all state that is live at the catch?  Use a patchpoint for the entire division instruction?

In a lot of languages, a divide produces some result even in the exceptional case and this result requires effectively deoptimizing since the resut won't be the one you would have predicted (double instead of int, or BigInt instead of small int), which sort of means that if the CPU exception occurs you have to be able to reconstruct all state.  A patchpoint could do this, and so could spilling all state to the stack before the divide - but both are very heavy hammers that are sure to be more expensive than just doing a branch.

For these reasons, WebKit has long gone with the approach of emitting branches even in our custom JITs; they are cheaper than the costs incurred by spilling or other craziness.  My bet is that the cheapest implementation in LLVM will involve some kind of branching on x86.

Another point that is bothering me is how these intrinsics will interact with loop invariant code motion, and other condition check optimizations.  We have the potential to hide control flow from optimizations which would otherwise remove it.  I'm not stating these aren't beneficial.  I'm simply stating that I remain somewhat unconvinced.  They seem like clear wins on some architectures (i.e. ARM, where the hardware includes the specific checks already), but not on others (i.e. x86 with it's exceptions). 
I don't follow this.  One of the proposed intrinsic's advantages is precisely to help with things like loop invariant code motion by not emitting confusing control flow early.

As for condition check optimizations, this has too low of profit for me to consider.  It's true that if you can prove that one of the operands to a division is a constant, then you probably don't want to emit all of the branches, and surely the proposed intrinsic allows for this.  But you're less likely to see code dividing by the same value repeatedly.  Or so goes my experience with this, anyway.

Given the above, I'm going withdraw my active support from the current proposal.  (Note: This does not mean that I'm going to actively oppose it either!)

Let me make a counter proposal of my own.

1) Let's not add generic intrinsics at this time.  Instead, let's add cpu specific intrinsics with the exact semantics of the associated div instructions.
  a) for ARM, this would be the semantics of the current proposed instruction (right?)
  b) for x86, this would be an instruction which is defined to fault on the corner cases (the resume behaviour would not be specified for the moment)
I don't like 1.b because the only thing you'd be able to do with the fault in the runtime is:

- Unwind the entire function if you're careful enough and you don't mind losing all of its state.  This woud only apply to Java-like languages, and only if they didn't have any finally/catch(all)/catch(arithmetic) statements in the surrounding scope.

- "interpret" the division to get your own semantics and resume execution, if your language allows for division to return an integer in the corner cases (i.e. JavaScript's (a/b)|0).  I concede that a trapping intrinsic would allow you to make this one case work, in cases where your profiling told you that the corner cases are rare.

- Arrange to spill all state to the stack just to be able to throw an exception and/or deoptimize - i.e. you're crazy enough to think that the branch you just saved was more expensive than a bunch of stack traffic.

I believe that the currently proposed intrinsics give us more of what we want on x86, than does your proposal, since they are more general, and since the branches will be cheaper than spilling everything to the stack.

To be clear, trap-based optimizations are something that we may eventually want to support but they require more magic unrelated to the specific issues of division.  I suspect that the nicest way of expressing trap-based optimizations in IR is to:

- Have a branch in IR that is weighted in such a way as to convey to LLVM that we believe that one side is never taken and that therefore implementing that side in terms of traps is OK.

- The thing being branched on is a predicate that is known to LLVM to be something that can be trapped on.

Notice that the proposed intrinsics are *the best* way of achieving this, once we have a way of optimizing such branches.  The LLVM backend will know that the i1's returned from the safe.div intrinsic are "trappable" on x86, and the frontend can arrange to weight the branches on those i1's appropriately if it wants a trap-based implementation.  Better yet, the frontend could simply feed whatever profiling it wants ("I saw the corner case X times and the main case Y times") and LLVM can decide what to do on its own.  The fact that the machanism by which execution is diverted to the "trapping" basic block is by way of a trap could even be contained entirely within LLVM's innards.

This could be used to support not only the unwinding semantics in Java on div/0, but also the (a/b)|0 in JS, and all of the others.  But first we need an intrinsic that reveals some predicate that is associated with the trapping condition, rather than having an intrinsic that is defined to simply trap.

(Notice also how my approach to exposing trapping in LLVM could handle things like null checks - if LLVM sees a branch on a pointer to see if it's null in the close vicinity of a memory access, and the branch is weighted 100% in favor of non-null, then it can "fold" the branch and load together and use a trap.  I imagine this being super easy to implement and I'd be happy to say more about how I'd do it.  So, to me, the issue with division isn't that we don't know how to trap on it - it's that we don't have a crystal-clear way of revealing the trapping conditions in IR.)

1b) Teach the optimizer about the semantics of each - if we do go with a unified safe.div instruction later, this effort could be mostly reused.
Except for x86, where your proposal above doesn't work.

2) Add optimizations (fairly late) which fold generic-div+control flow into machine specific div instructions.
This is purely aspirational.  There are many ways of doing the control flow for the division since there is quite a bit of it.  Are you going to pattern match all of them?  What if you miss a case because one of the many IR optimizations transformed it into something that is no longer recognizable?  Or are you going to impose rules for canonicalizing any control flow around any division to ensure that a div-by-zero and INT_MIN/-1 check never stops looking like the pattern that you wanted it to look like?

To me this sounds kind of crazy.  Once the frontend has determined that it wants safe division, it should be able to say this explicitly enough that this information is preserved through compilation.

3) Frontends that want machine specific semantics (i.e. trap or defined edge cases), can use the machine specific intrinsics directly. 
4) We provide a set of IR helper functions which implement safe division in what we believe to be the best way.  This can be different for each supported platform.  This could either be the form of sample IR, or even as functions on IRBuilder. 
5) We spend some time optimizing the IR generated by (4) and use that to inform our discussion about what is truly necessary from a performance standpoint. 

Once this is in place, we can come back to the discussion of what an appropriate generic safe div intrinsic would look like.  In particular, we would have the performance data at hand to lay to rest the concerns raised by myself and Eric.  My strong suspicion is that the counter proposal will be "good enough" in practice for most users. 

p.s. If we do go ahead with the current proposal, I'm going to *strongly* request they be marked experimental.  I think we'll need to iterate on these a few times. 

Philip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/da416eed/attachment.html>