[LLVMdev] Proposal: add intrinsics for safe division
Philip Reames
listmail at philipreames.com
Tue Apr 29 13:58:38 PDT 2014
Filip,
Thanks for spelling this out. I'm out of time to respond right now, but
I'll try to get back to you later today or tomorrow morning. After a
quick read through, I don't think you've changed my opinion, but I need
to read through what you wrote more carefully before responding.
Philip
On 04/29/2014 12:39 PM, Filip Pizlo wrote:
>
>
> On April 29, 2014 at 11:27:06 AM, Philip Reames
> (listmail at philipreames.com <mailto:listmail at philipreames.com>) wrote:
>
>>
>> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>>> LD;DR: Your desire to use trapping on x86 only further convinces me
>>> that Michael's proposed intrinsics are the best way to go.
>> I'm still not convinced, but am not going to actively oppose it
>> either. I'm leery of designing a solution with major assumptions we
>> don't have data to backup.
>>
>> I worry your assumptions about deoptimization are potentially
>> unsound. But I don't have data to actually show this (yet).
>
> I *think* I may have been unclear about my assumptions; in particular,
> my claims with respect to deoptimization are probably more subtle than
> they appeared. WebKit can use LLVM and it has divisions and we do all
> possible deoptimization/profiling/etc tricks for it, so this is
> grounded in experience. Forgive me if the rest of this e-mail
> contains a lecture on things that are obvious - I'll try to err on the
> side of clarity and completeness since this discussion is sufficiently
> dense that we run the risk of talking cross-purposes unless some
> baseline assumptions are established.
>
>>
>>
>>>
>>> On April 29, 2014 at 10:09:49 AM, Philip Reames
>>> (listmail at philipreames.com <mailto:listmail at philipreames.com>) wrote:
>>>
>>>> As the discussion has progressed and I've spent more time thinking
>>>> about the topic, I find myself less and less enthused about the
>>>> current proposal. I'm in full support of having idiomatic ways to
>>>> express safe division, but I'm starting to doubt that using an
>>>> intrinsic is the right way at the moment.
>>>>
>>>> One case I find myself thinking about is how one would combine
>>>> profiling information and implicit div-by-zero/overflow checks with
>>>> this proposal. I don't really see a clean way. Ideally, for a
>>>> "safe div" which never has the exceptional paths taken, you'd like
>>>> to completely do away with the control flow entirely. (And rely on
>>>> hardware traps w/exceptions instead.) I don't really see a way to
>>>> represent that type of construct given the current proposal.
>>>
>>> This is a deeper problem and to solve it you'd need a solution to
>>> trapping in general. Let's consider the case of Java. A Java
>>> program may want to catch the arithmetic exception due to divide by
>>> zero. How would you do this with a trap in LLVM IR? Spill all
>>> state that is live at the catch? Use a patchpoint for the entire
>>> division instruction?
>>>
>> We'd likely use something similar to a patchpoint. You'd need the
>> "abstract vm state" (which is not the compiled frame necessarily)
>> available at the div instruction. You could then re-enter the
>> interpreter at the specified index (part of the vm state). We have
>> all most of these mechanisms in place. Ideally, you'd trigger a
>> recompile and otherwise ensure re-entry into compiled code at the
>> soonest possible moment.
>>
>> This requires a lot of runtime support, but we already have most of
>> it implemented for another compiler. From our perspective, the
>> runtime requirements are not a major blocker.
>
> Right, you'll use a patchpoint. That's way more expensive than using
> a safe division intrinsic with branches, because it's opaque to the
> optimizer.
>
>>>
>>> In a lot of languages, a divide produces some result even in the
>>> exceptional case and this result requires effectively deoptimizing
>>> since the resut won't be the one you would have predicted (double
>>> instead of int, or BigInt instead of small int), which sort of means
>>> that if the CPU exception occurs you have to be able to reconstruct
>>> all state. A patchpoint could do this, and so could spilling all
>>> state to the stack before the divide - but both are very heavy
>>> hammers that are sure to be more expensive than just doing a branch.
>>>
>> This isn't necessarily as expensive as you might believe. I'd
>> recommend reading the Graal project papers on this topic.
>>
>> Whether deopt or branching is more profitable *in this case*, I can't
>> easily say. I'm not yet to the point of being able to run that
>> experiment. We can argue about what "should" be better all we want,
>> but real performance data is the only way to truly know.
>
> My point may have been confusing. I know that deoptimization is cheap
> and WebKit uses it everywhere, including division corner cases, if
> profiling tells us that it's profitable to do so (which it does, in
> the common case). WebKit is a heavy user of deoptimization in
> general, so you don't need to convince me that it's worth it.
>
> Note that I want *both* deopt *and* branching, because in this case, a
> branch is the fastest overall way of detecting when to deopt. In the
> future, I will want to implement the deopt in terms of branching, and
> when we do this, I believe that the most sound and performat approach
> would be using Michael's intrinsics. This is subtle and I'll try to
> explain why it's the case.
>
> The point is that you wouldn't want to do deoptimization by spilling
> state on the main path or by using a patchpoint for the main path of
> the division.
>
> You don't want the common path of executing the division to involve a
> patchpoint instruction, although using a patchpoint or stackmap to
> implement deoptimization on the failing path is great:
>
> *Good: *if (division would fail) { call @patchpoint(all of my state) }
> else { result = a / b }
>
> *Bad: *call @patchpoint(all of my state) // patch with a divide
> instruction - bad because the optimizer has no clue what you're doing
> and assumes the very worst
>
> *Worse: *spill all state to the stack; call @trapping.div(a, b) // the
> spills will hurt you far more than a branch, so this should be avoided
>
> I suppose we could imagine a fourth option that involves a patchpoint
> to pick up the state and a trapping divide instrinsic. But a trapping
> divide intrinsic alone is not enough. Consider this:
>
> result = call @trapping.div(a, b); call @stackmap(all of my state)
>
> As soon as these are separate instructions, you have no guarantees
> that the state that the stackmap reports is sound for the point at
> which the div would trap. So, the division itself shouldn't be a
> trapping instruction and instead you want to detect the bad case with
> a branch.
>
> To be clear:
>
> - Whether you use deoptimization for division or anything else - like
> WebKit has done since before any of the Graal papers were written - is
> mostly unrelated to how you represent the division, unless you wanted
> to add a new intrinsic that is like a trapping-division-with-stackmap:
>
> result = call @trapping.div.with.stackmap(a, b, ... all of my state ...)
>
> Now, maybe you do want such an intrinsic, in which case you should
> propose it! The reason why I haven't proposed it is that I think that
> long-term, the currently proposed intrinsics are a better path to
> getting the trapping optimizations. See my previous mail, where I
> show how we could tell LLVM what the failing path is (which may have
> deoptimization code that uses a stackmap or whatever), what the
> trapping predicate is (it comes from the safe.div intrinsic), and the
> fact that trapping is wise (branch weights).
>
> - If you want to do the deoptimization with a trap, then your only
> choice currently is to use a patchpoint for the main path of the
> division. This will be slower than using a branch to an OSR exit
> basic block, because you're making the division itself opaque to the
> optimizer (bad) just to get rid of a branch (which was probably cheap
> to begin with).
>
> Hence, what you want to do - one way or another, regardless of whether
> this proposed intrinsic is added - is to branch on the corner case
> condition, and have the slow case of the branch go to a basic block
> that deoptimizes. Unless of course you have profiling that says that
> the case does happen often, in which case you can have that basic
> block handle the corner case inline without leaving optimized code
> (FWIW, we do have such paths in WebKit and they are useful).
>
> So the question for me is whether the branching involves explicit
> control flow or is hidden inside an intrinsic. I prefer for it to be
> within an intrinsic because it:
>
> - allows the optimizer to do more interesting things in the common
> cases, like hoisting the entire division.
>
> - will give us a clearer path for implementing trapping optimizations
> in the future.
>
> - is an immediate win on ARM.
>
> I'd be curious to hear what specific idea you have about how to
> implement trap-based deoptimization with your trapping division
> intrinsic for x86 - maybe it's different from the two "bad" idioms I
> showed above.
>
> Finally, as for performance data, which part of this do you want
> performance data for? I concede that I don't have performance data
> for using Michael's new intrinsic. Part of what the intrinsic
> accomplishes is it gives a less ugly way of doing something that is
> already possible with target intrinsics on ARM. I think it would be
> great if you could get those semantics - along with a known-good
> implementation - on other architectures as well.
>
> But this discussion has also involved suggestions that we should use
> trapping to implement deoptimization, and the main point of my message
> is to strongly argue against anything like this given the current
> state of trapping semantics and how patchpoints work. My point is
> that using traps for division corner cases would either be unsound
> (see the stackmap after the trap, above), or would require you to do
> things that are obviously inefficient. If you truly believe that the
> branch to detect division slow paths is more expensive than spilling
> all bytecode state to the stack or using a patchpoint for the
> division, then I could probably hack something up in WebKit to show
> you the performance implications. (Or you could do it yourself, the
> code is open source...)
>
> -Filip
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140429/fbd6b2fb/attachment.html>
More information about the llvm-dev
mailing list