[LLVMdev] Proposal: add intrinsics for safe division

Tue Apr 29 12:39:56 PDT 2014

On April 29, 2014 at 11:27:06 AM, Philip Reames (listmail at philipreames.com) wrote:

On 04/29/2014 10:44 AM, Filip Pizlo wrote:
LD;DR: Your desire to use trapping on x86 only further convinces me that Michael's proposed intrinsics are the best way to go.
I'm still not convinced, but am not going to actively oppose it either.  I'm leery of designing a solution with major assumptions we don't have data to backup. 

I worry your assumptions about deoptimization are potentially unsound.  But I don't have data to actually show this (yet).
I *think* I may have been unclear about my assumptions; in particular, my claims with respect to deoptimization are probably more subtle than they appeared.  WebKit can use LLVM and it has divisions and we do all possible deoptimization/profiling/etc tricks for it, so this is grounded in experience.  Forgive me if the rest of this e-mail contains a lecture on things that are obvious - I'll try to err on the side of clarity and completeness since this discussion is sufficiently dense that we run the risk of talking cross-purposes unless some baseline assumptions are established.

On April 29, 2014 at 10:09:49 AM, Philip Reames (listmail at philipreames.com) wrote:

As the discussion has progressed and I've spent more time thinking about the topic, I find myself less and less enthused about the current proposal.  I'm in full support of having idiomatic ways to express safe division, but I'm starting to doubt that using an intrinsic is the right way at the moment.

One case I find myself thinking about is how one would combine profiling information and implicit div-by-zero/overflow checks with this proposal.  I don't really see a clean way.  Ideally, for a "safe div" which never has the exceptional paths taken, you'd like to completely do away with the control flow entirely.  (And rely on hardware traps w/exceptions instead.)  I don't really see a way to represent that type of construct given the current proposal. 
This is a deeper problem and to solve it you'd need a solution to trapping in general.  Let's consider the case of Java.  A Java program may want to catch the arithmetic exception due to divide by zero.  How would you do this with a trap in LLVM IR?  Spill all state that is live at the catch?  Use a patchpoint for the entire division instruction?

We'd likely use something similar to a patchpoint.  You'd need the "abstract vm state" (which is not the compiled frame necessarily) available at the div instruction.  You could then re-enter the interpreter at the specified index (part of the vm state).  We have all most of these mechanisms in place.  Ideally, you'd trigger a recompile and otherwise ensure re-entry into compiled code at the soonest possible moment. 

This requires a lot of runtime support, but we already have most of it implemented for another compiler.  From our perspective, the runtime requirements are not a major blocker. 
Right, you'll use a patchpoint.  That's way more expensive than using a safe division intrinsic with branches, because it's opaque to the optimizer.

In a lot of languages, a divide produces some result even in the exceptional case and this result requires effectively deoptimizing since the resut won't be the one you would have predicted (double instead of int, or BigInt instead of small int), which sort of means that if the CPU exception occurs you have to be able to reconstruct all state.  A patchpoint could do this, and so could spilling all state to the stack before the divide - but both are very heavy hammers that are sure to be more expensive than just doing a branch.

This isn't necessarily as expensive as you might believe.  I'd recommend reading the Graal project papers on this topic.

Whether deopt or branching is more profitable *in this case*, I can't easily say.  I'm not yet to the point of being able to run that experiment.  We can argue about what "should" be better all we want, but real performance data is the only way to truly know. 
My point may have been confusing.  I know that deoptimization is cheap and WebKit uses it everywhere, including division corner cases, if profiling tells us that it's profitable to do so (which it does, in the common case).  WebKit is a heavy user of deoptimization in general, so you don't need to convince me that it's worth it.

Note that I want *both* deopt *and* branching, because in this case, a branch is the fastest overall way of detecting when to deopt.  In the future, I will want to implement the deopt in terms of branching, and when we do this, I believe that the most sound and performat approach would be using Michael's intrinsics.  This is subtle and I'll try to explain why it's the case.

The point is that you wouldn't want to do deoptimization by spilling state on the main path or by using a patchpoint for the main path of the division.

You don't want the common path of executing the division to involve a patchpoint instruction, although using a patchpoint or stackmap to implement deoptimization on the failing path is great:

Good: if (division would fail) { call @patchpoint(all of my state) } else { result = a / b }

Bad: call @patchpoint(all of my state) // patch with a divide instruction - bad because the optimizer has no clue what you're doing and assumes the very worst

Worse: spill all state to the stack; call @trapping.div(a, b) // the spills will hurt you far more than a branch, so this should be avoided

I suppose we could imagine a fourth option that involves a patchpoint to pick up the state and a trapping divide instrinsic.  But a trapping divide intrinsic alone is not enough.  Consider this:

result = call @trapping.div(a, b); call @stackmap(all of my state)

As soon as these are separate instructions, you have no guarantees that the state that the stackmap reports is sound for the point at which the div would trap.  So, the division itself shouldn't be a trapping instruction and instead you want to detect the bad case with a branch.

To be clear:

- Whether you use deoptimization for division or anything else - like WebKit has done since before any of the Graal papers were written - is mostly unrelated to how you represent the division, unless you wanted to add a new intrinsic that is like a trapping-division-with-stackmap:

result = call @trapping.div.with.stackmap(a, b, ... all of my state ...)

Now, maybe you do want such an intrinsic, in which case you should propose it!  The reason why I haven't proposed it is that I think that long-term, the currently proposed intrinsics are a better path to getting the trapping optimizations.  See my previous mail, where I show how we could tell LLVM what the failing path is (which may have deoptimization code that uses a stackmap or whatever), what the trapping predicate is (it comes from the safe.div intrinsic), and the fact that trapping is wise (branch weights).

- If you want to do the deoptimization with a trap, then your only choice currently is to use a patchpoint for the main path of the division.  This will be slower than using a branch to an OSR exit basic block, because you're making the division itself opaque to the optimizer (bad) just to get rid of a branch (which was probably cheap to begin with).

Hence, what you want to do - one way or another, regardless of whether this proposed intrinsic is added - is to branch on the corner case condition, and have the slow case of the branch go to a basic block that deoptimizes.  Unless of course you have profiling that says that the case does happen often, in which case you can have that basic block handle the corner case inline without leaving optimized code (FWIW, we do have such paths in WebKit and they are useful).

So the question for me is whether the branching involves explicit control flow or is hidden inside an intrinsic.  I prefer for it to be within an intrinsic because it:

- allows the optimizer to do more interesting things in the common cases, like hoisting the entire division.

- will give us a clearer path for implementing trapping optimizations in the future.

- is an immediate win on ARM.

I'd be curious to hear what specific idea you have about how to implement trap-based deoptimization with your trapping division intrinsic for x86 - maybe it's different from the two "bad" idioms I showed above.

Finally, as for performance data, which part of this do you want performance data for?  I concede that I don't have performance data for using Michael's new intrinsic.  Part of what the intrinsic accomplishes is it gives a less ugly way of doing something that is already possible with target intrinsics on ARM.  I think it would be great if you could get those semantics - along with a known-good implementation - on other architectures as well.

But this discussion has also involved suggestions that we should use trapping to implement deoptimization, and the main point of my message is to strongly argue against anything like this given the current state of trapping semantics and how patchpoints work.  My point is that using traps for division corner cases would either be unsound (see the stackmap after the trap, above), or would require you to do things that are obviously inefficient.  If you truly believe that the branch to detect division slow paths is more expensive than spilling all bytecode state to the stack or using a patchpoint for the division, then I could probably hack something up in WebKit to show you the performance implications.  (Or you could do it yourself, the code is open source...)

-Filip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/c070212f/attachment.html>