[LLVMdev] Proposal: add intrinsics for safe division

Tue Apr 29 13:58:38 PDT 2014

Filip,

Thanks for spelling this out.  I'm out of time to respond right now, but 
I'll try to get back to you later today or tomorrow morning. After a 
quick read through, I don't think you've changed my opinion, but I need 
to read through what you wrote more carefully before responding.

Philip

On 04/29/2014 12:39 PM, Filip Pizlo wrote:
>
>
> On April 29, 2014 at 11:27:06 AM, Philip Reames 
> (listmail at philipreames.com <mailto:listmail at philipreames.com>) wrote:
>
>>
>> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>>> LD;DR: Your desire to use trapping on x86 only further convinces me 
>>> that Michael's proposed intrinsics are the best way to go.
>> I'm still not convinced, but am not going to actively oppose it 
>> either.  I'm leery of designing a solution with major assumptions we 
>> don't have data to backup.
>>
>> I worry your assumptions about deoptimization are potentially 
>> unsound.  But I don't have data to actually show this (yet).
>
> I *think* I may have been unclear about my assumptions; in particular, 
> my claims with respect to deoptimization are probably more subtle than 
> they appeared.  WebKit can use LLVM and it has divisions and we do all 
> possible deoptimization/profiling/etc tricks for it, so this is 
> grounded in experience.  Forgive me if the rest of this e-mail 
> contains a lecture on things that are obvious - I'll try to err on the 
> side of clarity and completeness since this discussion is sufficiently 
> dense that we run the risk of talking cross-purposes unless some 
> baseline assumptions are established.
>
>>
>>
>>>
>>> On April 29, 2014 at 10:09:49 AM, Philip Reames 
>>> (listmail at philipreames.com <mailto:listmail at philipreames.com>) wrote:
>>>
>>>> As the discussion has progressed and I've spent more time thinking 
>>>> about the topic, I find myself less and less enthused about the 
>>>> current proposal.  I'm in full support of having idiomatic ways to 
>>>> express safe division, but I'm starting to doubt that using an 
>>>> intrinsic is the right way at the moment.
>>>>
>>>> One case I find myself thinking about is how one would combine 
>>>> profiling information and implicit div-by-zero/overflow checks with 
>>>> this proposal.  I don't really see a clean way.  Ideally, for a 
>>>> "safe div" which never has the exceptional paths taken, you'd like 
>>>> to completely do away with the control flow entirely.  (And rely on 
>>>> hardware traps w/exceptions instead.)  I don't really see a way to 
>>>> represent that type of construct given the current proposal.
>>>
>>> This is a deeper problem and to solve it you'd need a solution to 
>>> trapping in general.  Let's consider the case of Java.  A Java 
>>> program may want to catch the arithmetic exception due to divide by 
>>> zero.  How would you do this with a trap in LLVM IR?  Spill all 
>>> state that is live at the catch?  Use a patchpoint for the entire 
>>> division instruction?
>>>
>> We'd likely use something similar to a patchpoint. You'd need the 
>> "abstract vm state" (which is not the compiled frame necessarily) 
>> available at the div instruction.  You could then re-enter the 
>> interpreter at the specified index (part of the vm state).  We have 
>> all most of these mechanisms in place.  Ideally, you'd trigger a 
>> recompile and otherwise ensure re-entry into compiled code at the 
>> soonest possible moment.
>>
>> This requires a lot of runtime support, but we already have most of 
>> it implemented for another compiler. From our perspective, the 
>> runtime requirements are not a major blocker.
>
> Right, you'll use a patchpoint.  That's way more expensive than using 
> a safe division intrinsic with branches, because it's opaque to the 
> optimizer.
>
>>>
>>> In a lot of languages, a divide produces some result even in the 
>>> exceptional case and this result requires effectively deoptimizing 
>>> since the resut won't be the one you would have predicted (double 
>>> instead of int, or BigInt instead of small int), which sort of means 
>>> that if the CPU exception occurs you have to be able to reconstruct 
>>> all state.  A patchpoint could do this, and so could spilling all 
>>> state to the stack before the divide - but both are very heavy 
>>> hammers that are sure to be more expensive than just doing a branch.
>>>
>> This isn't necessarily as expensive as you might believe.  I'd 
>> recommend reading the Graal project papers on this topic.
>>
>> Whether deopt or branching is more profitable *in this case*, I can't 
>> easily say.  I'm not yet to the point of being able to run that 
>> experiment.  We can argue about what "should" be better all we want, 
>> but real performance data is the only way to truly know.
>
> My point may have been confusing.  I know that deoptimization is cheap 
> and WebKit uses it everywhere, including division corner cases, if 
> profiling tells us that it's profitable to do so (which it does, in 
> the common case).  WebKit is a heavy user of deoptimization in 
> general, so you don't need to convince me that it's worth it.
>
> Note that I want *both* deopt *and* branching, because in this case, a 
> branch is the fastest overall way of detecting when to deopt.  In the 
> future, I will want to implement the deopt in terms of branching, and 
> when we do this, I believe that the most sound and performat approach 
> would be using Michael's intrinsics.  This is subtle and I'll try to 
> explain why it's the case.
>
> The point is that you wouldn't want to do deoptimization by spilling 
> state on the main path or by using a patchpoint for the main path of 
> the division.
>
> You don't want the common path of executing the division to involve a 
> patchpoint instruction, although using a patchpoint or stackmap to 
> implement deoptimization on the failing path is great:
>
> *Good: *if (division would fail) { call @patchpoint(all of my state) } 
> else { result = a / b }
>
> *Bad: *call @patchpoint(all of my state) // patch with a divide 
> instruction - bad because the optimizer has no clue what you're doing 
> and assumes the very worst
>
> *Worse: *spill all state to the stack; call @trapping.div(a, b) // the 
> spills will hurt you far more than a branch, so this should be avoided
>
> I suppose we could imagine a fourth option that involves a patchpoint 
> to pick up the state and a trapping divide instrinsic.  But a trapping 
> divide intrinsic alone is not enough.  Consider this:
>
> result = call @trapping.div(a, b); call @stackmap(all of my state)
>
> As soon as these are separate instructions, you have no guarantees 
> that the state that the stackmap reports is sound for the point at 
> which the div would trap.  So, the division itself shouldn't be a 
> trapping instruction and instead you want to detect the bad case with 
> a branch.
>
> To be clear:
>
> - Whether you use deoptimization for division or anything else - like 
> WebKit has done since before any of the Graal papers were written - is 
> mostly unrelated to how you represent the division, unless you wanted 
> to add a new intrinsic that is like a trapping-division-with-stackmap:
>
> result = call @trapping.div.with.stackmap(a, b, ... all of my state ...)
>
> Now, maybe you do want such an intrinsic, in which case you should 
> propose it!  The reason why I haven't proposed it is that I think that 
> long-term, the currently proposed intrinsics are a better path to 
> getting the trapping optimizations.  See my previous mail, where I 
> show how we could tell LLVM what the failing path is (which may have 
> deoptimization code that uses a stackmap or whatever), what the 
> trapping predicate is (it comes from the safe.div intrinsic), and the 
> fact that trapping is wise (branch weights).
>
> - If you want to do the deoptimization with a trap, then your only 
> choice currently is to use a patchpoint for the main path of the 
> division.  This will be slower than using a branch to an OSR exit 
> basic block, because you're making the division itself opaque to the 
> optimizer (bad) just to get rid of a branch (which was probably cheap 
> to begin with).
>
> Hence, what you want to do - one way or another, regardless of whether 
> this proposed intrinsic is added - is to branch on the corner case 
> condition, and have the slow case of the branch go to a basic block 
> that deoptimizes.  Unless of course you have profiling that says that 
> the case does happen often, in which case you can have that basic 
> block handle the corner case inline without leaving optimized code 
> (FWIW, we do have such paths in WebKit and they are useful).
>
> So the question for me is whether the branching involves explicit 
> control flow or is hidden inside an intrinsic.  I prefer for it to be 
> within an intrinsic because it:
>
> - allows the optimizer to do more interesting things in the common 
> cases, like hoisting the entire division.
>
> - will give us a clearer path for implementing trapping optimizations 
> in the future.
>
> - is an immediate win on ARM.
>
> I'd be curious to hear what specific idea you have about how to 
> implement trap-based deoptimization with your trapping division 
> intrinsic for x86 - maybe it's different from the two "bad" idioms I 
> showed above.
>
> Finally, as for performance data, which part of this do you want 
> performance data for?  I concede that I don't have performance data 
> for using Michael's new intrinsic.  Part of what the intrinsic 
> accomplishes is it gives a less ugly way of doing something that is 
> already possible with target intrinsics on ARM.  I think it would be 
> great if you could get those semantics - along with a known-good 
> implementation - on other architectures as well.
>
> But this discussion has also involved suggestions that we should use 
> trapping to implement deoptimization, and the main point of my message 
> is to strongly argue against anything like this given the current 
> state of trapping semantics and how patchpoints work.  My point is 
> that using traps for division corner cases would either be unsound 
> (see the stackmap after the trap, above), or would require you to do 
> things that are obviously inefficient.  If you truly believe that the 
> branch to detect division slow paths is more expensive than spilling 
> all bytecode state to the stack or using a patchpoint for the 
> division, then I could probably hack something up in WebKit to show 
> you the performance implications.  (Or you could do it yourself, the 
> code is open source...)
>
> -Filip
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/fbd6b2fb/attachment.html>