[LLVMdev] Proposal: add intrinsics for safe division

Fri May 2 12:00:38 PDT 2014

On Fri, May 2, 2014 at 11:58 AM, Philip Reames
<listmail at philipreames.com> wrote:
>
> On 05/02/2014 11:57 AM, Filip Pizlo wrote:
>
>
> On May 2, 2014 at 11:53:25 AM, Eric Christopher (echristo at gmail.com) wrote:
>
> On Wed, Apr 30, 2014 at 10:34 PM, Philip Reames
> <listmail at philipreames.com> wrote:
>> Andy - If you're not already following this closely, please start. We've
>> gotten into fairly fundamental questions of what a patchpoint does.
>>
>> Filip,
>>
>> I think you've hit the nail on the head. What I'm thinking of as being
>> patchpoints are not what you think they are. Part of that is that I've got
>> a local change which adds a very similar construction (called
>> "statepoints"
>> for the moment), but I was trying to keep that separate. That also
>> includes
>> a lot of GC semantics which are not under discussion currently. My
>> apologies if that experience bled over into this conversation and made
>> things more confusing.
>>
>> I will note that the documentation for patchpoint say explicitly the
>> following:
>> "The ‘llvm.experimental.patchpoint.*‘ intrinsics creates a function call
>> to
>> the specified <target> and records the location of specified values in the
>> stack map."
>>
>> My reading has always been that a patchpoint *that isn't patched* is
>> simply
>> a call with a stackmap associated with it. To my reading, this can (and
>> did, and does) indicate my proposed usage would be legal.
>>
>
> I like the idea that the target can be assumed to be called. It makes
> optimization of the call possible, etc. I think it's definitely worth
> exploring before we lock down the patchpoint intrinsic.
>
> I will actively oppose this.
>
> I think it sounds like we need to split patchpoint into two cases.  I'm
> going to send a rough proposal for this later today.
>

A proposal definitely sounds interesting but if it makes more sense to
keep it as single intrinsics then that's preferable. We've already got
two in a kind of weird way with patchpoint and stackmap.

-eric

> -Filip
>
>
>
>
> -eric
>
>> I will agree that I've confused the topic badly on the optimization front.
>> My "statepoint" isn't patchable, so a lot more optimizations are legal.
>> Sorry about that. To restate what I think you've been saying all along,
>> the
>> optimizer can't make assumptions about what function is called by a
>> patchpoint because that might change based on later patching. Is this the
>> key point you've been trying to make?
>>
>> I'm not objecting to separating "my patchpoint" from "your patchpoint".
>> Let's just hammer out the semantics of each first. :)
>>
>> Again, longer response to follow in a day or so. :)
>>
>> Philip
>>
>>
>> On 04/30/2014 10:09 PM, Filip Pizlo wrote:
>>
>>
>>
>> On April 30, 2014 at 9:06:20 PM, Philip Reames (listmail at philipreames.com)
>> wrote:
>>
>> On 04/29/2014 12:39 PM, Filip Pizlo wrote:
>>
>> On April 29, 2014 at 11:27:06 AM, Philip Reames
>> (listmail at philipreames.com)
>> wrote:
>>
>> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>>
>> LD;DR: Your desire to use trapping on x86 only further convinces me that
>> Michael's proposed intrinsics are the best way to go.
>>
>> I'm still not convinced, but am not going to actively oppose it either.
>> I'm
>> leery of designing a solution with major assumptions we don't have data to
>> backup.
>>
>> I worry your assumptions about deoptimization are potentially unsound. But
>> I don't have data to actually show this (yet).
>>
>> I *think* I may have been unclear about my assumptions; in particular, my
>> claims with respect to deoptimization are probably more subtle than they
>> appeared. WebKit can use LLVM and it has divisions and we do all possible
>> deoptimization/profiling/etc tricks for it, so this is grounded in
>> experience. Forgive me if the rest of this e-mail contains a lecture on
>> things that are obvious - I'll try to err on the side of clarity and
>> completeness since this discussion is sufficiently dense that we run the
>> risk of talking cross-purposes unless some baseline assumptions are
>> established.
>>
>> I think we're using the same terminology, but with slightly different sets
>> of assumptions. I'll point this out below where relevant.
>>
>> Also, thanks for taking the time to expand. It help clarify the discussion
>> quite a bit.
>>
>> I think we may be converging to an understanding of what you want versus
>> what I want, and I think that there are some points - possibly unrelated
>> to
>> division - that are worth clarifying. I think that the main difference is
>> that when I say "patchpoint", I am referring to a concrete intrinsic with
>> specific semantics that cannot change without breaking WebKit, while you
>> are
>> using the term to refer to a broad concept, or rather, a class of
>> as-yet-unimplemented intrinsics that share some of the same features with
>> patchpoints but otherwise have incompatible semantics.
>>
>> Also, when I say that you wouldn't want to use the existing patchpoint to
>> do
>> your trapping deopt, what I mean is that the performance of doing this
>> would
>> suck for reasons that are not related to deoptimization or trapping. I'm
>> not claiming that deoptimization performs poorly (trust me, I know better)
>> or that trapping to deoptimize is bad (I've done this many, many times and
>> I
>> know better). I'm saying that with the current patchpoint intrinsics in
>> LLVM, as they are currently specified and implemented, doing it would be a
>> bad idea because you'd have to compromise a bunch of other optimizations
>> to
>> achieve it.
>>
>> You have essentially described new intrinsics that would make this less of
>> a
>> bad idea and I am interested in your intrinsics, so I'll try to both
>> respond
>> with why patchpoints don't currently give you what you want (and why
>> simply
>> changing patchpoint semantics would be evil) and I'll also try to comment
>> on
>> what I think of the intrinsic that you're effectively proposing. Long
>> story
>> short, I think you should formally propose your intrinsic even if it's not
>> completely fleshed out. I think that it's an interesting capability and in
>> its most basic form, it is a simple variation of the current
>> patchpoint/stackmap intrinsics.
>>
>>
>>
>>
>>
>> On April 29, 2014 at 10:09:49 AM, Philip Reames
>> (listmail at philipreames.com)
>> wrote:
>>
>> As the discussion has progressed and I've spent more time thinking about
>> the
>> topic, I find myself less and less enthused about the current proposal.
>> I'm
>> in full support of having idiomatic ways to express safe division, but I'm
>> starting to doubt that using an intrinsic is the right way at the moment.
>>
>> One case I find myself thinking about is how one would combine profiling
>> information and implicit div-by-zero/overflow checks with this proposal. I
>> don't really see a clean way. Ideally, for a "safe div" which never has
>> the
>> exceptional paths taken, you'd like to completely do away with the control
>> flow entirely. (And rely on hardware traps w/exceptions instead.) I don't
>> really see a way to represent that type of construct given the current
>> proposal.
>>
>> This is a deeper problem and to solve it you'd need a solution to trapping
>> in general. Let's consider the case of Java. A Java program may want to
>> catch the arithmetic exception due to divide by zero. How would you do
>> this
>> with a trap in LLVM IR? Spill all state that is live at the catch? Use a
>> patchpoint for the entire division instruction?
>>
>> We'd likely use something similar to a patchpoint. You'd need the
>> "abstract
>> vm state" (which is not the compiled frame necessarily) available at the
>> div
>> instruction. You could then re-enter the interpreter at the specified
>> index
>> (part of the vm state). We have all most of these mechanisms in place.
>> Ideally, you'd trigger a recompile and otherwise ensure re-entry into
>> compiled code at the soonest possible moment.
>>
>> This requires a lot of runtime support, but we already have most of it
>> implemented for another compiler. From our perspective, the runtime
>> requirements are not a major blocker.
>>
>> Right, you'll use a patchpoint. That's way more expensive than using a
>> safe
>> division intrinsic with branches, because it's opaque to the optimizer.
>>
>> This statement is true at the moment, but it shouldn't be. I think this is
>> our fundamental difference in approach.
>>
>> You should be able to write something like:
>> i32 %res = invoke patchpoint (... x86_trapping_divide, a, b) normal_dest
>> invoke_dest
>>
>> normal_dest:
>> ;; use %res
>> invoke_dest:
>> landingpad
>> ;; dispatch edge cases
>> ;; this could be unreachable code if you deopt this frame in the trap
>> handler and jump directly to an interpreter or other bit of code
>>
>> I see. It sounds like you want a generalization of the "div.with.stackmap"
>> that I thought you wanted - you want to be able to wrap anything in a
>> stackmap.
>>
>> The current patchpoint intrinsic does not do this, and you run the risk of
>> breaking existing semantics if you changed this. You'd probably break
>> WebKit, which treats the call target of the patchpoint as nothing more
>> than
>> a quirk - we always pass null. Also, the current patchpoint treats the
>> callee as an i8* if I remember right and it would be super weird if all
>> LLVM
>> phases had to interpret this i8* by unwrapping a possible bitcast to get
>> to
>> a declared function that may be an intrinsic. Yuck! Basically, the call
>> target of existing patchpoints is meant to be a kind of convenience
>> feature
>> rather than the core of the mechanism.
>>
>> I agree in principle that the intrinsic that you want would be a useful
>> intrinsic. But let's not call it a patchpoint for the purposes of this
>> discussion, and let's not confuse the discussion by claiming (incorrectly)
>> that the existing patchpoint facility gives you what you want. It doesn't:
>> patchpoints are designed to make the call target opaque (hence the use of
>> i8*) and there shouldn't be a correlation between what the patchpoint does
>> at run-time and what the called function would have done. You could make
>> the call target be null (like WebKit does) and the patchpoint should still
>> mean "this code can do anything" because the expectation is that the
>> client
>> JIT will patch over it anyway.
>>
>> Also, "patchpoint" would probably not be the right term for the intrinsic
>> that you want. I think that what you want is "call.with.stackmap". Or
>> maybe "stackmap.wrapper". Or just "stackmap" - I'd be OK, in principle,
>> with changing the name of the current "stackmap" intrinsic to something
>> that
>> reflects the fact that it's less of a stackmap than what you want.
>>
>> To summarize. A patchpoint's main purpose is that you can patch it with
>> arbitrary code. The current "stackmap" means that you can patch it with
>> arbitrary code and that patching may be destructive to a shadow of machine
>> code bytes, so it's really just like patchpoints - we could change its
>> name
>> to "patchpoint.shadow" for example.
>>
>> If you were to propose such a stackmap intrinsic, then I think there could
>> be some ways of doing it that wouldn't be too terrible. Basically you want
>> something that is like a patchpoint in that it reports a stackmap via a
>> side
>> channel, but unlike patchpoints, it doesn't allow arbitrary patching -
>> instead the optimizer should be allowed to assume that the code within the
>> patchpoint will always do the same thing that the call target would have
>> done. There are downsides to truly doing this. For example, to make
>> division efficient with such an intrinsic, you'd have to do something that
>> is somewhat worse than just recognizing intrinsics in the optimizer -
>> you'd
>> have to first recognize a call to your "stackmap wrapper" intrinsic and
>> then
>> observe that its call target argument is in turn another intrinsic. To me
>> personally, that's kind of yucky, but I won't deny that it could be
>> useful.
>>
>> As to the use of invoke: I don't believe that the use of invoke versus my
>> suggested "branch on a trap predicate" idea are different in any truly
>> meaningful way. I buy that either would work.
>>
>>
>>
>> A patchpoint should not require any excess spilling. If values are live in
>> registers, that should be reflected in the stack map. (I do not know if
>> this is the case for patchpoint at the moment or not.)
>>
>> Patchpoints do not require spilling.
>>
>> My point was that with existing patchpoints, you *either* use a patchpoint
>> for the entire division which makes the division opaque to the optimizer -
>> because a patchpoint means "this code can do anything" - *or* you could
>> spill stuff to the stack prior to your trapping division intrinsic, since
>> spilling is something that you could do as a workaround if you didn't have
>> a
>> patchpoint.
>>
>> The reason why I brought up spilling at all is that I suspect that
>> spilling
>> all state to the stack might be cheaper - for some systems - than turning
>> the division into a patchpoint. Turning the division into a patchpoint is
>> horrendously brutal - the patchpoint looks like it clobbers the heap
>> (which
>> a division doesn't do), has to execute (a division is an obvious DCE
>> candidate), cannot be hoisted (hoisting divisions is awesome), etc.
>> Perhaps
>> most importantly, though, a patchpoint doesn't tell LLVM that you're
>> *doing
>> a division* - so all constant folding, strenght reduction, and algebraic
>> reasoning flies out the window. On the other hand, spilling all state to
>> the stack is an arguably sound and performant solution to a lot of VM
>> problems. I've seen JVM implementations that ensure that there is always a
>> copy of state on the stack at some critical points, just because it makes
>> loads of stuff simpler (debugging, profiling, GC, and of course deopt). I
>> personally prefer to stay away from such a strategy because it's not free.
>>
>> On the other hand, branches can be cheap. A branch on a divide is cheaper
>> than not being able to optimize the divide.
>>
>>
>>
>> The Value called by a patchpoint should participate in optimization
>> normally.
>>
>> I agree that you could have a different intrinsic that behaves like this.
>>
>> We really want the patchpoint part of the call to be supplemental. It
>> should still be a call. It should be constant propagated, transformed,
>> etc.. This is not the case currently. I've got a couple of off the wall
>> ideas for improving the current status, but I'll agree this is a hardish
>> problem.
>>
>> It should be legal to use a patchpoint in an invoke. It's an ABI issue of
>> how the invoke path gets invoked. (i.e. side tables for the runtime to
>> lookup, etc..) This is not possible today, and probably requires a fair
>> amount of work. Some of it, I've already done and will be sharing shortly.
>> Other parts, I haven't even thought about.
>>
>> Right, it's significantly more complex than either the existing
>> patchpoints
>> or Michael's proposed safe.div.
>>
>>
>>
>> If you didn't want to use the trapping semantics, you'd insert dedicated
>> control flow _before_ the divide. This would allow normal optimization of
>> the control flow.
>>
>> Notes:
>> 1) This might require a new PATCHPOINT pseudo op in the backend. Haven't
>> thought much about that yet.
>> 2) I *think* your current intrinsic could be translated into something
>> like
>> this. (Leaving aside the question of where the deopt state comes from.) In
>> fact, the more I look at this, the less difference I actually see between
>> the approaches.
>>
>>
>>
>> In a lot of languages, a divide produces some result even in the
>> exceptional
>> case and this result requires effectively deoptimizing since the resut
>> won't
>> be the one you would have predicted (double instead of int, or BigInt
>> instead of small int), which sort of means that if the CPU exception
>> occurs
>> you have to be able to reconstruct all state. A patchpoint could do this,
>> and so could spilling all state to the stack before the divide - but both
>> are very heavy hammers that are sure to be more expensive than just doing
>> a
>> branch.
>>
>> This isn't necessarily as expensive as you might believe. I'd recommend
>> reading the Graal project papers on this topic.
>>
>> Whether deopt or branching is more profitable *in this case*, I can't
>> easily
>> say. I'm not yet to the point of being able to run that experiment. We can
>> argue about what "should" be better all we want, but real performance data
>> is the only way to truly know.
>>
>> My point may have been confusing. I know that deoptimization is cheap and
>> WebKit uses it everywhere, including division corner cases, if profiling
>> tells us that it's profitable to do so (which it does, in the common
>> case).
>> WebKit is a heavy user of deoptimization in general, so you don't need to
>> convince me that it's worth it.
>>
>> Acknowledged.
>>
>> Note that I want *both* deopt *and* branching, because in this case, a
>> branch is the fastest overall way of detecting when to deopt. In the
>> future, I will want to implement the deopt in terms of branching, and when
>> we do this, I believe that the most sound and performat approach would be
>> using Michael's intrinsics. This is subtle and I'll try to explain why
>> it's
>> the case.
>>
>> The point is that you wouldn't want to do deoptimization by spilling state
>> on the main path or by using a patchpoint for the main path of the
>> division.
>>
>> This is the main point I disagree with. I don't believe that having a
>> patchpoint on the main path should be any more expensive then the original
>> call. (see above)
>>
>> The reason why the patchpoint is expensive is that if you use a patchpoint
>> to implement a division then the optimizer won't be allowed to assume that
>> it's a division, because the whole point of "patchpoint" is to tell the
>> optimizer to piss off.
>>
>>
>>
>> Worth noting explicitly: I'm assuming that all of your deopt state would
>> already be available for other purposes in nearby code. It's on the stack
>> or in registers. I'm assuming that by adding the deopt point, you are not
>> radically changing the set of computations which need to be done. If
>> that's
>> not the case, you should avoid deopt and instead just inline the slow
>> paths
>> with explicit checks.
>>
>> Yes, of course it is. That's not the issue.
>>
>>
>>
>> I'll note that given your assumptions about the cost of a patchpoint, the
>> rest of your position makes a lot more sense. :) As I spelled out above, I
>> believe this cost is not fundamental.
>>
>> You don't want the common path of executing the division to involve a
>> patchpoint instruction, although using a patchpoint or stackmap to
>> implement
>> deoptimization on the failing path is great:
>>
>> Good: if (division would fail) { call @patchpoint(all of my state) } else
>> {
>> result = a / b }
>>
>> Given your cost assumptions, I'd agree.
>>
>> Not my cost assumptions. The reason why this is better is that the
>> division
>> is expressed in LLVM IR so that LLVM can do useful things to it - like
>> eliminate it, for example.
>>
>>
>> Bad: call @patchpoint(all of my state) // patch with a divide instruction
>> -
>> bad because the optimizer has no clue what you're doing and assumes the
>> very
>> worst
>>
>> Yuck. Agreed.
>>
>> To be clear, this is what you're proposing - except that you're assuming
>> that LLVM will know that you've patched a division because you're
>> expecting
>> the call target to have semantic meaning. Or, rather, you're expecting
>> that
>> you can make the contents of the patchpoint be a division by having the
>> call
>> target be a division intrinsic. In the current implementation and as it is
>> currently specified, the call target has no meaning and so you get the
>> yuck
>> that I'm showing.
>>
>>
>> Worse: spill all state to the stack; call @trapping.div(a, b) // the
>> spills
>> will hurt you far more than a branch, so this should be avoided
>>
>> I'm assuming this is an explicit spill rather than simply recording a
>> stack
>> map *at the div*. If so, agreed.
>>
>> I suppose we could imagine a fourth option that involves a patchpoint to
>> pick up the state and a trapping divide instrinsic. But a trapping divide
>> intrinsic alone is not enough. Consider this:
>>
>> result = call @trapping.div(a, b); call @stackmap(all of my state)
>>
>> As soon as these are separate instructions, you have no guarantees that
>> the
>> state that the stackmap reports is sound for the point at which the div
>> would trap.
>>
>> This is the closest to what I'd propose, except that the two calls would
>> be
>> merged into a single patchpoint. Isn't the entire point of a patchpoint to
>> record the stack map for a call?
>>
>> No. It would be bad if that's what the documentation says. That's not at
>> all how WebKit uses it or probably any IC client would use it.
>>
>> Patchpoints are designed to be inline assembly on steroids. They're there
>> to allow the client JIT to tell LLVM to piss off.
>>
>> (Well, ignoring the actual patching part..) Why not write this as:
>> patchpoint(..., trapping.div, a, b);
>>
>> Is there something I'm missing here?
>>
>> Just to note: I fully agree that the two call proposal is unsound and
>> should
>> be avoided.
>>
>> So, the division itself shouldn't be a trapping instruction and instead
>> you
>> want to detect the bad case with a branch.
>>
>> To be clear:
>>
>> - Whether you use deoptimization for division or anything else - like
>> WebKit
>> has done since before any of the Graal papers were written - is mostly
>> unrelated to how you represent the division, unless you wanted to add a
>> new
>> intrinsic that is like a trapping-division-with-stackmap:
>>
>> result = call @trapping.div.with.stackmap(a, b, ... all of my state ...)
>>
>> Now, maybe you do want such an intrinsic, in which case you should propose
>> it!
>>
>> Given what we already have with patchpoints, I don't think a merged
>> intrinsic is necessary. (See above). I believe we have all the parts to
>> build this solution, and that - in theory - they should compose neatly.
>>
>> p.s. The bits I was referring to was not deopt per se. It was particularly
>> which set of deopt state you used for each deopt point. That's a bit of
>> tangent for the rest of the discussion now though.
>>
>> The reason why I haven't proposed it is that I think that long-term, the
>> currently proposed intrinsics are a better path to getting the trapping
>> optimizations. See my previous mail, where I show how we could tell LLVM
>> what the failing path is (which may have deoptimization code that uses a
>> stackmap or whatever), what the trapping predicate is (it comes from the
>> safe.div intrinsic), and the fact that trapping is wise (branch weights).
>>
>> - If you want to do the deoptimization with a trap, then your only choice
>> currently is to use a patchpoint for the main path of the division. This
>> will be slower than using a branch to an OSR exit basic block, because
>> you're making the division itself opaque to the optimizer (bad) just to
>> get
>> rid of a branch (which was probably cheap to begin with).
>>
>> Hence, what you want to do - one way or another, regardless of whether
>> this
>> proposed intrinsic is added - is to branch on the corner case condition,
>> and
>> have the slow case of the branch go to a basic block that deoptimizes.
>> Unless of course you have profiling that says that the case does happen
>> often, in which case you can have that basic block handle the corner case
>> inline without leaving optimized code (FWIW, we do have such paths in
>> WebKit
>> and they are useful).
>>
>> So the question for me is whether the branching involves explicit control
>> flow or is hidden inside an intrinsic. I prefer for it to be within an
>> intrinsic because it:
>>
>> - allows the optimizer to do more interesting things in the common cases,
>> like hoisting the entire division.
>>
>> - will give us a clearer path for implementing trapping optimizations in
>> the
>> future.
>>
>> - is an immediate win on ARM.
>>
>> I'd be curious to hear what specific idea you have about how to implement
>> trap-based deoptimization with your trapping division intrinsic for x86 -
>> maybe it's different from the two "bad" idioms I showed above.
>>
>> I hope my explanation above helps. If not, ask, and I'll try to explain
>> more clearly.
>>
>> I think I understand it. I think that the only issue is that:
>>
>> - Patchpoints currently don't do what you want.
>>
>> - If you made patchpoints do what you want then you'd break WebKit - not
>> to
>> mention anyone who wants to use them for inline caches.
>>
>> So it seems like you want a new intrinsic. You should officially propose
>> this new intrinsic, particularly since the core semantic differences are
>> not
>> so great from what we have now. OTOH, if you truly believe that
>> patchpoints
>> should just be changed to your semantics in a way that does break WebKit,
>> then that's probably also something you should get off your chest. ;-)
>>
>>
>>
>> One point just for clarity; I don't believe this effects the conclusions
>> of
>> our discussion so far. I'm also fairly sure that you (Filip) are aware of
>> this, but want to spell it out for other readers.
>>
>> You seem to be assuming that compiled code needs to explicitly branch to a
>> point where deopt state is known to exit a compiled frame.
>>
>> This is a slightly unclear characterization of my assumptions. Our JIT
>> does
>> deoptimization without explicit branches for many, many things. You should
>> look at it some time, it's pretty fancy. ;-)
>>
>> Worth noting is that you can also exit a compiled frame on a trap (without
>> an explicitly condition/branch!) if the deopt state is known at the point
>> you take the trap. This "exit frame on trap" behavior shows up with null
>> pointer exceptions as well. I'll note that both compilers in OpenJDK
>> support some combination of "exit-on-trap" conditions for division and
>> null
>> dereferences. The two differ on exactly what strategies they use in each
>> case though. :)
>>
>> Yeah, and I've also implemented VMs that do this - and I endorse the basic
>> idea. I know what you want, and my only point is that the existing
>> patchpoints only give you this if you're willing to make a huge
>> compromise:
>> namely, that you're willing to make the division (or heap load for the
>> null
>> case) completely opaque to the compiler to the point that GVN, LICM, SCCP,
>> and all algebraic reasoning have to give up on optimizing it. The point of
>> using LLVM is that it can optimize code. It can optimize branches and
>> divisions pretty well. So, eliminating an explicit branch by replacing it
>> with a construct that appears opaque to the optimizer is not a smart
>> trade-off.
>>
>> You could add a new intrinsic that, like patchpoints, records the layout
>> of
>> state in a side-table, but that is used as a kind of wrapper for
>> operations
>> that LLVM understands. This may or may not be hairy - you seem to have
>> sort
>> of acknowledged that it's got some complexity and I've also pointed out
>> some
>> possible issues. If this is something that you want, you should propose it
>> so that others know what you're talking about. One danger of how we're
>> discussing this right now is that you're overloading patchpoints to mean
>> the
>> thing you want them to mean rather than what they actually mean, which
>> makes
>> it seem like we don't need Michael's intrinsics on the grounds that
>> patchpoints already offer a solution. They don't already offer a solution
>> precisely because patchpoints don't do what your intrinsics would do.
>>
>>
>>
>> I'm not really arguing that either scheme is "better" in all cases. I'm
>> simply arguing that we should support both and allow optimization and
>> tuning
>> between them. As far as I can tell, you seem to be assuming that an
>> explicit branch to known exit point is always superior.
>>
>>
>> Ok, back to the topic at hand...
>>
>> With regards to the current proposal, I'm going to take a step back. You
>> guys seem to have already looked in this in a fair amount of depth. I'm
>> not
>> necessarily convinced you've come to the best solution, but at some point,
>> we need to make forward progress. What you have is clearly better than
>> nothing.
>>
>> Please go ahead and submit your current approach. We can come back and
>> revise later if we really need to.
>>
>> I do request the following changes:
>> - Mark it clearly as experimental.
>>
>>
>> - Either don't specify the value computed in the edge cases, or allow
>> those
>> values to be specified as constant arguments to the call. This allows
>> efficient lowering to x86's div instruction if you want to make use of the
>> trapping semantics.
>>
>> Once again: how would you use this to get trapping semantics without
>> throwing all of LLVM's optimizations out the window, in the absence of the
>> kind of patchpoint-like intrinsic that you want? I ask just to make sure
>> that we're on the same page.
>>
>>
>>
>> Finally, as for performance data, which part of this do you want
>> performance
>> data for? I concede that I don't have performance data for using Michael's
>> new intrinsic. Part of what the intrinsic accomplishes is it gives a less
>> ugly way of doing something that is already possible with target
>> intrinsics
>> on ARM. I think it would be great if you could get those semantics - along
>> with a known-good implementation - on other architectures as well.
>>
>> I would be very interested in seeing data comparing two schemes:
>> - Explicit control flow emited by the frontend
>> - The safe.div intrinsic emitted by the frontend, desugared in CodeGenPrep
>>
>> My strong suspicion is that each would preform well in some cases and not
>> in
>> others. At least on x86. Since the edge-checks are essentially free on
>> ARM, the second scheme would probably be strictly superior there.
>>
>> I am NOT asking that we block submission on this data however.
>>
>> But this discussion has also involved suggestions that we should use
>> trapping to implement deoptimization, and the main point of my message is
>> to
>> strongly argue against anything like this given the current state of
>> trapping semantics and how patchpoints work. My point is that using traps
>> for division corner cases would either be unsound (see the stackmap after
>> the trap, above), or would require you to do things that are obviously
>> inefficient. If you truly believe that the branch to detect division slow
>> paths is more expensive than spilling all bytecode state to the stack or
>> using a patchpoint for the division, then I could probably hack something
>> up
>> in WebKit to show you the performance implications. (Or you could do it
>> yourself, the code is open source...)
>>
>> In a couple of months, I'll probably have the performance data to discuss
>> this for real. When that happens, let's pick this up and continue the
>> debate. Alternatively, if you want to chat this over more with a beer in
>> hand at the social next week, let me know. In the meantime, let's not
>> stall
>> the current proposal any more.
>>
>> Philip
>>
>>
>
>