[LLVMdev] Proposal: add intrinsics for safe division

Fri May 2 11:53:22 PDT 2014

On Wed, Apr 30, 2014 at 10:34 PM, Philip Reames
<listmail at philipreames.com> wrote:
> Andy - If you're not already following this closely, please start.  We've
> gotten into fairly fundamental questions of what a patchpoint does.
>
> Filip,
>
> I think you've hit the nail on the head.  What I'm thinking of as being
> patchpoints are not what you think they are.  Part of that is that I've got
> a local change which adds a very similar construction (called "statepoints"
> for the moment), but I was trying to keep that separate.  That also includes
> a lot of GC semantics which are not under discussion currently.  My
> apologies if that experience bled over into this conversation and made
> things more confusing.
>
> I will note that the documentation for patchpoint say explicitly the
> following:
> "The ‘llvm.experimental.patchpoint.*‘ intrinsics creates a function call to
> the specified <target> and records the location of specified values in the
> stack map."
>
> My reading has always been that a patchpoint *that isn't patched* is simply
> a call with a stackmap associated with it.  To my reading, this can (and
> did, and does) indicate my proposed usage would be legal.
>

I like the idea that the target can be assumed to be called. It makes
optimization of the call possible, etc. I think it's definitely worth
exploring before we lock down the patchpoint intrinsic.

-eric

> I will agree that I've confused the topic badly on the optimization front.
> My "statepoint" isn't patchable, so a lot more optimizations are legal.
> Sorry about that.  To restate what I think you've been saying all along, the
> optimizer can't make assumptions about what function is called by a
> patchpoint because that might change based on later patching.  Is this the
> key point you've been trying to make?
>
> I'm not objecting to separating "my patchpoint" from "your patchpoint".
> Let's just hammer out the semantics of each first.  :)
>
> Again, longer response to follow in a day or so. :)
>
> Philip
>
>
> On 04/30/2014 10:09 PM, Filip Pizlo wrote:
>
>
>
> On April 30, 2014 at 9:06:20 PM, Philip Reames (listmail at philipreames.com)
> wrote:
>
> On 04/29/2014 12:39 PM, Filip Pizlo wrote:
>
> On April 29, 2014 at 11:27:06 AM, Philip Reames (listmail at philipreames.com)
> wrote:
>
> On 04/29/2014 10:44 AM, Filip Pizlo wrote:
>
> LD;DR: Your desire to use trapping on x86 only further convinces me that
> Michael's proposed intrinsics are the best way to go.
>
> I'm still not convinced, but am not going to actively oppose it either.  I'm
> leery of designing a solution with major assumptions we don't have data to
> backup.
>
> I worry your assumptions about deoptimization are potentially unsound.  But
> I don't have data to actually show this (yet).
>
> I *think* I may have been unclear about my assumptions; in particular, my
> claims with respect to deoptimization are probably more subtle than they
> appeared.  WebKit can use LLVM and it has divisions and we do all possible
> deoptimization/profiling/etc tricks for it, so this is grounded in
> experience.  Forgive me if the rest of this e-mail contains a lecture on
> things that are obvious - I'll try to err on the side of clarity and
> completeness since this discussion is sufficiently dense that we run the
> risk of talking cross-purposes unless some baseline assumptions are
> established.
>
> I think we're using the same terminology, but with slightly different sets
> of assumptions.  I'll point this out below where relevant.
>
> Also, thanks for taking the time to expand.  It help clarify the discussion
> quite a bit.
>
> I think we may be converging to an understanding of what you want versus
> what I want, and I think that there are some points - possibly unrelated to
> division - that are worth clarifying.  I think that the main difference is
> that when I say "patchpoint", I am referring to a concrete intrinsic with
> specific semantics that cannot change without breaking WebKit, while you are
> using the term to refer to a broad concept, or rather, a class of
> as-yet-unimplemented intrinsics that share some of the same features with
> patchpoints but otherwise have incompatible semantics.
>
> Also, when I say that you wouldn't want to use the existing patchpoint to do
> your trapping deopt, what I mean is that the performance of doing this would
> suck for reasons that are not related to deoptimization or trapping.  I'm
> not claiming that deoptimization performs poorly (trust me, I know better)
> or that trapping to deoptimize is bad (I've done this many, many times and I
> know better).  I'm saying that with the current patchpoint intrinsics in
> LLVM, as they are currently specified and implemented, doing it would be a
> bad idea because you'd have to compromise a bunch of other optimizations to
> achieve it.
>
> You have essentially described new intrinsics that would make this less of a
> bad idea and I am interested in your intrinsics, so I'll try to both respond
> with why patchpoints don't currently give you what you want (and why simply
> changing patchpoint semantics would be evil) and I'll also try to comment on
> what I think of the intrinsic that you're effectively proposing.  Long story
> short, I think you should formally propose your intrinsic even if it's not
> completely fleshed out.  I think that it's an interesting capability and in
> its most basic form, it is a simple variation of the current
> patchpoint/stackmap intrinsics.
>
>
>
>
>
> On April 29, 2014 at 10:09:49 AM, Philip Reames (listmail at philipreames.com)
> wrote:
>
> As the discussion has progressed and I've spent more time thinking about the
> topic, I find myself less and less enthused about the current proposal.  I'm
> in full support of having idiomatic ways to express safe division, but I'm
> starting to doubt that using an intrinsic is the right way at the moment.
>
> One case I find myself thinking about is how one would combine profiling
> information and implicit div-by-zero/overflow checks with this proposal.  I
> don't really see a clean way.  Ideally, for a "safe div" which never has the
> exceptional paths taken, you'd like to completely do away with the control
> flow entirely.  (And rely on hardware traps w/exceptions instead.)  I don't
> really see a way to represent that type of construct given the current
> proposal.
>
> This is a deeper problem and to solve it you'd need a solution to trapping
> in general.  Let's consider the case of Java.  A Java program may want to
> catch the arithmetic exception due to divide by zero.  How would you do this
> with a trap in LLVM IR?  Spill all state that is live at the catch?  Use a
> patchpoint for the entire division instruction?
>
> We'd likely use something similar to a patchpoint.  You'd need the "abstract
> vm state" (which is not the compiled frame necessarily) available at the div
> instruction.  You could then re-enter the interpreter at the specified index
> (part of the vm state).  We have all most of these mechanisms in place.
> Ideally, you'd trigger a recompile and otherwise ensure re-entry into
> compiled code at the soonest possible moment.
>
> This requires a lot of runtime support, but we already have most of it
> implemented for another compiler.  From our perspective, the runtime
> requirements are not a major blocker.
>
> Right, you'll use a patchpoint.  That's way more expensive than using a safe
> division intrinsic with branches, because it's opaque to the optimizer.
>
> This statement is true at the moment, but it shouldn't be.  I think this is
> our fundamental difference in approach.
>
> You should be able to write something like:
> i32 %res = invoke patchpoint (... x86_trapping_divide, a, b) normal_dest
> invoke_dest
>
> normal_dest:
>   ;; use %res
> invoke_dest:
>   landingpad
>   ;; dispatch edge cases
>   ;; this could be unreachable code if you deopt this frame in the trap
> handler and jump directly to an interpreter or other bit of code
>
> I see.  It sounds like you want a generalization of the "div.with.stackmap"
> that I thought you wanted - you want to be able to wrap anything in a
> stackmap.
>
> The current patchpoint intrinsic does not do this, and you run the risk of
> breaking existing semantics if you changed this.  You'd probably break
> WebKit, which treats the call target of the patchpoint as nothing more than
> a quirk - we always pass null.  Also, the current patchpoint treats the
> callee as an i8* if I remember right and it would be super weird if all LLVM
> phases had to interpret this i8* by unwrapping a possible bitcast to get to
> a declared function that may be an intrinsic.  Yuck!  Basically, the call
> target of existing patchpoints is meant to be a kind of convenience feature
> rather than the core of the mechanism.
>
> I agree in principle that the intrinsic that you want would be a useful
> intrinsic.  But let's not call it a patchpoint for the purposes of this
> discussion, and let's not confuse the discussion by claiming (incorrectly)
> that the existing patchpoint facility gives you what you want.  It doesn't:
> patchpoints are designed to make the call target opaque (hence the use of
> i8*) and there shouldn't be a correlation between what the patchpoint does
> at run-time and what the called function would have done.  You could make
> the call target be null (like WebKit does) and the patchpoint should still
> mean "this code can do anything" because the expectation is that the client
> JIT will patch over it anyway.
>
> Also, "patchpoint" would probably not be the right term for the intrinsic
> that you want.  I think that what you want is "call.with.stackmap".  Or
> maybe "stackmap.wrapper".  Or just "stackmap" - I'd be OK, in principle,
> with changing the name of the current "stackmap" intrinsic to something that
> reflects the fact that it's less of a stackmap than what you want.
>
> To summarize.  A patchpoint's main purpose is that you can patch it with
> arbitrary code.  The current "stackmap" means that you can patch it with
> arbitrary code and that patching may be destructive to a shadow of machine
> code bytes, so it's really just like patchpoints - we could change its name
> to "patchpoint.shadow" for example.
>
> If you were to propose such a stackmap intrinsic, then I think there could
> be some ways of doing it that wouldn't be too terrible.  Basically you want
> something that is like a patchpoint in that it reports a stackmap via a side
> channel, but unlike patchpoints, it doesn't allow arbitrary patching -
> instead the optimizer should be allowed to assume that the code within the
> patchpoint will always do the same thing that the call target would have
> done.  There are downsides to truly doing this.  For example, to make
> division efficient with such an intrinsic, you'd have to do something that
> is somewhat worse than just recognizing intrinsics in the optimizer - you'd
> have to first recognize a call to your "stackmap wrapper" intrinsic and then
> observe that its call target argument is in turn another intrinsic.  To me
> personally, that's kind of yucky, but I won't deny that it could be useful.
>
> As to the use of invoke: I don't believe that the use of invoke versus my
> suggested "branch on a trap predicate" idea are different in any truly
> meaningful way.  I buy that either would work.
>
>
>
> A patchpoint should not require any excess spilling.  If values are live in
> registers, that should be reflected in the stack map.  (I do not know if
> this is the case for patchpoint at the moment or not.)
>
> Patchpoints do not require spilling.
>
> My point was that with existing patchpoints, you *either* use a patchpoint
> for the entire division which makes the division opaque to the optimizer -
> because a patchpoint means "this code can do anything" - *or* you could
> spill stuff to the stack prior to your trapping division intrinsic, since
> spilling is something that you could do as a workaround if you didn't have a
> patchpoint.
>
> The reason why I brought up spilling at all is that I suspect that spilling
> all state to the stack might be cheaper - for some systems - than turning
> the division into a patchpoint.  Turning the division into a patchpoint is
> horrendously brutal - the patchpoint looks like it clobbers the heap (which
> a division doesn't do), has to execute (a division is an obvious DCE
> candidate), cannot be hoisted (hoisting divisions is awesome), etc.  Perhaps
> most importantly, though, a patchpoint doesn't tell LLVM that you're *doing
> a division* - so all constant folding, strenght reduction, and algebraic
> reasoning flies out the window.  On the other hand, spilling all state to
> the stack is an arguably sound and performant solution to a lot of VM
> problems.  I've seen JVM implementations that ensure that there is always a
> copy of state on the stack at some critical points, just because it makes
> loads of stuff simpler (debugging, profiling, GC, and of course deopt).  I
> personally prefer to stay away from such a strategy because it's not free.
>
> On the other hand, branches can be cheap.  A branch on a divide is cheaper
> than not being able to optimize the divide.
>
>
>
> The Value called by a patchpoint should participate in optimization
> normally.
>
> I agree that you could have a different intrinsic that behaves like this.
>
> We really want the patchpoint part of the call to be supplemental.  It
> should still be a call.  It should be constant propagated, transformed,
> etc..  This is not the case currently.  I've got a couple of off the wall
> ideas for improving the current status, but I'll agree this is a hardish
> problem.
>
> It should be legal to use a patchpoint in an invoke.  It's an ABI issue of
> how the invoke path gets invoked.  (i.e. side tables for the runtime to
> lookup, etc..)  This is not possible today, and probably requires a fair
> amount of work.  Some of it, I've already done and will be sharing shortly.
> Other parts, I haven't even thought about.
>
> Right, it's significantly more complex than either the existing patchpoints
> or Michael's proposed safe.div.
>
>
>
> If you didn't want to use the trapping semantics, you'd insert dedicated
> control flow _before_ the divide.  This would allow normal optimization of
> the control flow.
>
> Notes:
> 1) This might require a new PATCHPOINT pseudo op in the backend.  Haven't
> thought much about that yet.
> 2) I *think* your current intrinsic could be translated into something like
> this.  (Leaving aside the question of where the deopt state comes from.)  In
> fact, the more I look at this, the less difference I actually see between
> the approaches.
>
>
>
> In a lot of languages, a divide produces some result even in the exceptional
> case and this result requires effectively deoptimizing since the resut won't
> be the one you would have predicted (double instead of int, or BigInt
> instead of small int), which sort of means that if the CPU exception occurs
> you have to be able to reconstruct all state.  A patchpoint could do this,
> and so could spilling all state to the stack before the divide - but both
> are very heavy hammers that are sure to be more expensive than just doing a
> branch.
>
> This isn't necessarily as expensive as you might believe.  I'd recommend
> reading the Graal project papers on this topic.
>
> Whether deopt or branching is more profitable *in this case*, I can't easily
> say.  I'm not yet to the point of being able to run that experiment.  We can
> argue about what "should" be better all we want, but real performance data
> is the only way to truly know.
>
> My point may have been confusing.  I know that deoptimization is cheap and
> WebKit uses it everywhere, including division corner cases, if profiling
> tells us that it's profitable to do so (which it does, in the common case).
> WebKit is a heavy user of deoptimization in general, so you don't need to
> convince me that it's worth it.
>
> Acknowledged.
>
> Note that I want *both* deopt *and* branching, because in this case, a
> branch is the fastest overall way of detecting when to deopt.  In the
> future, I will want to implement the deopt in terms of branching, and when
> we do this, I believe that the most sound and performat approach would be
> using Michael's intrinsics.  This is subtle and I'll try to explain why it's
> the case.
>
> The point is that you wouldn't want to do deoptimization by spilling state
> on the main path or by using a patchpoint for the main path of the division.
>
> This is the main point I disagree with.  I don't believe that having a
> patchpoint on the main path should be any more expensive then the original
> call.  (see above)
>
> The reason why the patchpoint is expensive is that if you use a patchpoint
> to implement a division then the optimizer won't be allowed to assume that
> it's a division, because the whole point of "patchpoint" is to tell the
> optimizer to piss off.
>
>
>
> Worth noting explicitly: I'm assuming that all of your deopt state would
> already be available for other purposes in nearby code.  It's on the stack
> or in registers.  I'm assuming that by adding the deopt point, you are not
> radically changing the set of computations which need to be done.  If that's
> not the case, you should avoid deopt and instead just inline the slow paths
> with explicit checks.
>
> Yes, of course it is.  That's not the issue.
>
>
>
> I'll note that given your assumptions about the cost of a patchpoint, the
> rest of your position makes a lot more sense.  :)  As I spelled out above, I
> believe this cost is not fundamental.
>
> You don't want the common path of executing the division to involve a
> patchpoint instruction, although using a patchpoint or stackmap to implement
> deoptimization on the failing path is great:
>
> Good: if (division would fail) { call @patchpoint(all of my state) } else {
> result = a / b }
>
> Given your cost assumptions, I'd agree.
>
> Not my cost assumptions.  The reason why this is better is that the division
> is expressed in LLVM IR so that LLVM can do useful things to it - like
> eliminate it, for example.
>
>
> Bad: call @patchpoint(all of my state) // patch with a divide instruction -
> bad because the optimizer has no clue what you're doing and assumes the very
> worst
>
> Yuck.  Agreed.
>
> To be clear, this is what you're proposing - except that you're assuming
> that LLVM will know that you've patched a division because you're expecting
> the call target to have semantic meaning.  Or, rather, you're expecting that
> you can make the contents of the patchpoint be a division by having the call
> target be a division intrinsic.  In the current implementation and as it is
> currently specified, the call target has no meaning and so you get the yuck
> that I'm showing.
>
>
> Worse: spill all state to the stack; call @trapping.div(a, b) // the spills
> will hurt you far more than a branch, so this should be avoided
>
> I'm assuming this is an explicit spill rather than simply recording a stack
> map *at the div*.  If so, agreed.
>
> I suppose we could imagine a fourth option that involves a patchpoint to
> pick up the state and a trapping divide instrinsic.  But a trapping divide
> intrinsic alone is not enough.  Consider this:
>
> result = call @trapping.div(a, b); call @stackmap(all of my state)
>
> As soon as these are separate instructions, you have no guarantees that the
> state that the stackmap reports is sound for the point at which the div
> would trap.
>
> This is the closest to what I'd propose, except that the two calls would be
> merged into a single patchpoint.  Isn't the entire point of a patchpoint to
> record the stack map for a call?
>
> No.  It would be bad if that's what the documentation says.  That's not at
> all how WebKit uses it or probably any IC client would use it.
>
> Patchpoints are designed to be inline assembly on steroids.  They're there
> to allow the client JIT to tell LLVM to piss off.
>
> (Well, ignoring the actual patching part..)  Why not write this as:
> patchpoint(..., trapping.div, a, b);
>
> Is there something I'm missing here?
>
> Just to note: I fully agree that the two call proposal is unsound and should
> be avoided.
>
> So, the division itself shouldn't be a trapping instruction and instead you
> want to detect the bad case with a branch.
>
> To be clear:
>
> - Whether you use deoptimization for division or anything else - like WebKit
> has done since before any of the Graal papers were written - is mostly
> unrelated to how you represent the division, unless you wanted to add a new
> intrinsic that is like a trapping-division-with-stackmap:
>
> result = call @trapping.div.with.stackmap(a, b, ... all of my state ...)
>
> Now, maybe you do want such an intrinsic, in which case you should propose
> it!
>
> Given what we already have with patchpoints, I don't think a merged
> intrinsic is necessary.  (See above).  I believe we have all the parts to
> build this solution, and that - in theory - they should compose neatly.
>
> p.s. The bits I was referring to was not deopt per se.  It was particularly
> which set of deopt state you used for each deopt point.  That's a bit of
> tangent for the rest of the discussion now though.
>
> The reason why I haven't proposed it is that I think that long-term, the
> currently proposed intrinsics are a better path to getting the trapping
> optimizations.  See my previous mail, where I show how we could tell LLVM
> what the failing path is (which may have deoptimization code that uses a
> stackmap or whatever), what the trapping predicate is (it comes from the
> safe.div intrinsic), and the fact that trapping is wise (branch weights).
>
> - If you want to do the deoptimization with a trap, then your only choice
> currently is to use a patchpoint for the main path of the division.  This
> will be slower than using a branch to an OSR exit basic block, because
> you're making the division itself opaque to the optimizer (bad) just to get
> rid of a branch (which was probably cheap to begin with).
>
> Hence, what you want to do - one way or another, regardless of whether this
> proposed intrinsic is added - is to branch on the corner case condition, and
> have the slow case of the branch go to a basic block that deoptimizes.
> Unless of course you have profiling that says that the case does happen
> often, in which case you can have that basic block handle the corner case
> inline without leaving optimized code (FWIW, we do have such paths in WebKit
> and they are useful).
>
> So the question for me is whether the branching involves explicit control
> flow or is hidden inside an intrinsic.  I prefer for it to be within an
> intrinsic because it:
>
> - allows the optimizer to do more interesting things in the common cases,
> like hoisting the entire division.
>
> - will give us a clearer path for implementing trapping optimizations in the
> future.
>
> - is an immediate win on ARM.
>
> I'd be curious to hear what specific idea you have about how to implement
> trap-based deoptimization with your trapping division intrinsic for x86 -
> maybe it's different from the two "bad" idioms I showed above.
>
> I hope my explanation above helps.  If not, ask, and I'll try to explain
> more clearly.
>
> I think I understand it.  I think that the only issue is that:
>
> - Patchpoints currently don't do what you want.
>
> - If you made patchpoints do what you want then you'd break WebKit - not to
> mention anyone who wants to use them for inline caches.
>
> So it seems like you want a new intrinsic.  You should officially propose
> this new intrinsic, particularly since the core semantic differences are not
> so great from what we have now.  OTOH, if you truly believe that patchpoints
> should just be changed to your semantics in a way that does break WebKit,
> then that's probably also something you should get off your chest. ;-)
>
>
>
> One point just for clarity; I don't believe this effects the conclusions of
> our discussion so far.  I'm also fairly sure that you (Filip) are aware of
> this, but want to spell it out for other readers.
>
> You seem to be assuming that compiled code needs to explicitly branch to a
> point where deopt state is known to exit a compiled frame.
>
> This is a slightly unclear characterization of my assumptions.  Our JIT does
> deoptimization without explicit branches for many, many things.  You should
> look at it some time, it's pretty fancy. ;-)
>
> Worth noting is that you can also exit a compiled frame on a trap (without
> an explicitly condition/branch!) if the deopt state is known at the point
> you take the trap.  This "exit frame on trap" behavior shows up with null
> pointer exceptions as well.  I'll note that both compilers in OpenJDK
> support some combination of "exit-on-trap" conditions for division and null
> dereferences.  The two differ on exactly what strategies they use in each
> case though.  :)
>
> Yeah, and I've also implemented VMs that do this - and I endorse the basic
> idea.  I know what you want, and my only point is that the existing
> patchpoints only give you this if you're willing to make a huge compromise:
> namely, that you're willing to make the division (or heap load for the null
> case) completely opaque to the compiler to the point that GVN, LICM, SCCP,
> and all algebraic reasoning have to give up on optimizing it.  The point of
> using LLVM is that it can optimize code.  It can optimize branches and
> divisions pretty well.  So, eliminating an explicit branch by replacing it
> with a construct that appears opaque to the optimizer is not a smart
> trade-off.
>
> You could add a new intrinsic that, like patchpoints, records the layout of
> state in a side-table, but that is used as a kind of wrapper for operations
> that LLVM understands.  This may or may not be hairy - you seem to have sort
> of acknowledged that it's got some complexity and I've also pointed out some
> possible issues.  If this is something that you want, you should propose it
> so that others know what you're talking about.  One danger of how we're
> discussing this right now is that you're overloading patchpoints to mean the
> thing you want them to mean rather than what they actually mean, which makes
> it seem like we don't need Michael's intrinsics on the grounds that
> patchpoints already offer a solution.  They don't already offer a solution
> precisely because patchpoints don't do what your intrinsics would do.
>
>
>
> I'm not really arguing that either scheme is "better" in all cases.  I'm
> simply arguing that we should support both and allow optimization and tuning
> between them.  As far as I can tell, you seem to be assuming that an
> explicit branch to known exit point is always superior.
>
>
> Ok, back to the topic at hand...
>
> With regards to the current proposal, I'm going to take a step back.  You
> guys seem to have already looked in this in a fair amount of depth.  I'm not
> necessarily convinced you've come to the best solution, but at some point,
> we need to make forward progress.  What you have is clearly better than
> nothing.
>
> Please go ahead and submit your current approach.  We can come back and
> revise later if we really need to.
>
> I do request the following changes:
> - Mark it clearly as experimental.
>
>
> - Either don't specify the value computed in the edge cases, or allow those
> values to be specified as constant arguments to the call.  This allows
> efficient lowering to x86's div instruction if you want to make use of the
> trapping semantics.
>
> Once again: how would you use this to get trapping semantics without
> throwing all of LLVM's optimizations out the window, in the absence of the
> kind of patchpoint-like intrinsic that you want?  I ask just to make sure
> that we're on the same page.
>
>
>
> Finally, as for performance data, which part of this do you want performance
> data for?  I concede that I don't have performance data for using Michael's
> new intrinsic.  Part of what the intrinsic accomplishes is it gives a less
> ugly way of doing something that is already possible with target intrinsics
> on ARM.  I think it would be great if you could get those semantics - along
> with a known-good implementation - on other architectures as well.
>
> I would be very interested in seeing data comparing two schemes:
> - Explicit control flow emited by the frontend
> - The safe.div intrinsic emitted by the frontend, desugared in CodeGenPrep
>
> My strong suspicion is that each would preform well in some cases and not in
> others.  At least on x86.  Since the edge-checks are essentially free on
> ARM, the second scheme would probably be strictly superior there.
>
> I am NOT asking that we block submission on this data however.
>
> But this discussion has also involved suggestions that we should use
> trapping to implement deoptimization, and the main point of my message is to
> strongly argue against anything like this given the current state of
> trapping semantics and how patchpoints work.  My point is that using traps
> for division corner cases would either be unsound (see the stackmap after
> the trap, above), or would require you to do things that are obviously
> inefficient.  If you truly believe that the branch to detect division slow
> paths is more expensive than spilling all bytecode state to the stack or
> using a patchpoint for the division, then I could probably hack something up
> in WebKit to show you the performance implications.  (Or you could do it
> yourself, the code is open source...)
>
> In a couple of months, I'll probably have the performance data to discuss
> this for real.  When that happens, let's pick this up and continue the
> debate.  Alternatively, if you want to chat this over more with a beer in
> hand at the social next week, let me know.  In the meantime, let's not stall
> the current proposal any more.
>
> Philip
>
>