[cfe-dev] Lambdas and the ABI?

Wed Mar 22 11:41:14 PDT 2017

On 21 March 2017 at 21:01, John McCall <rjmccall at apple.com> wrote:

> > On Mar 21, 2017, at 7:24 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> > On 03/21/2017 03:02 PM, John McCall wrote:
> >>> On Mar 21, 2017, at 9:03 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> >>> Hi Richard, Chandler, John, et al.,
> >>>
> >>> "Quick" question: What aspects, if any, of a C++ lambda (e.g. size,
> layout, alignment) leak into the ABI or are (potentially) semantically
> visible?
> >> C++ lambdas are anonymous local types that are constrained to appear in
> function bodies (including implicit "function" bodies, e.g. the
> initializers of global variables).  The main semantic rule for local types
> like lambdas is that they're supposed to be the "same type" in all
> translation units that see the same function body, meaning (among many
> other things) that the ABI properties of the type must be the same.  Now,
> most function bodies can only appear in only a single translation unit, so
> this rule is trivially satisfied and the compiler has total flexibility.
> But the body of an inline or template function can usually appear in
> multiple translation units, and so the ABI has to be consistent.
> >>
> >>> Context...
> >>>
> >>> We're investigating late lowering/outlining schemes to improve code
> generation for OpenMP and other parallel programming models. One important
> advantage of outlining late is that the IR-level optimizer still has access
> to the pointer-aliasing and loop information from the containing function.
> There are natural benefits to C++ lambdas as well. Lambdas that are used
> "in place" (i.e. only have one call site) should always be inlined, so the
> issues don't come up there, but for lambdas that have multiple call sites,
> or worse, escape (via std::function or some other type-erasure mechanism),
> we can get suboptimal optimization results for the body of the lambda. It
> would seem sad to fix this for OpenMP but not for lambdas.
> >>>
> >>> However, optimizing before outlining could mean changes in what
> variables need to be captured (not semantically, but at the IR level), and
> so I'd like to think through what would constrain the optimizer's freedom
> to act in this regard.
> >> "Subfunction" representations are very useful for coroutine-esque
> situations where an inner function's control flow interactions with the
> outer function are essentially well-understood.  That is, to be useful, you
> really want the blocks of the inner function to fit into the CFG of the
> outer function at a unique and appropriate place; otherwise you're just
> contorting the representation of the outer function in a way that will
> probably massively hamper optimization.
> >>
> >> I can definitely see how there might be many situations in OpenMP that
> have this kind of CFG knowledge.  Swift has some situations like this as
> well.  Unfortunately, C++ lambdas and ObjC blocks are not a case where we
> have this knowledge, because the closure can be passed around and invoked
> arbitrarily many times and at arbitrary later points, which is not
> something that can ever fit cleanly into the CFG; pre-emptive "outlining"
> is just a superior representation for such cases.
> >
> > This is a very good point. We only know that the lambda will execute,
> and perhaps execute multiple times, after some given point. The same is
> true (to some extent) for some OpenMP features. For example, we can
> generate OpenMP tasks in a function, and we know only that they'll run
> sometime before the end of the containing parallel region, taskwait
> directive, etc. This might be outside the scope of the function.
> >
> > One interesting question is: To what extent does this affect our ability
> use analysis (AA, SCEV, etc.) of the variables at point of capture to
> optimize the body of the "subfunction." The aliasing properties of
> pointers, for example, and the bounds of captured induction variables,
> should be fine. However, if there's anything that might be rendered invalid
> by some later change in state (e.g. a write to memory), that wouldn't be
> usable. We also need to make sure that pointers appropriately appear to be
> captured (just as they would be if we used an outlined function
> representation). Luckily, I suppose, we need to do that anyway.
>
> Right.  I would say that there are three interesting levels of
> closure-capture optimization:
>
> 1. We don't know when the inner function will be called; the outer
> function's frame may not still be intact.  This is generally true of blocks
> and lambdas because they can be copied off the stack.  If the block/lambda
> captures a reference to something in the outer function, U.B. might let us
> assume that the frame still exists, but I think only if the block/lambda
> actually uses that captured reference.  Richard, let us know if there's
> some special rule that would help us here.
>

Any use of a non-reference captured by reference would require the captured
variable to still exist, and so we can in principle use a pointer to the
outer function's stack frame for those captures (but not by-copy captures
nor reference captures of references). But we can't assume that the outer
function's stack frame will be active when such a lambda is called, only
when such a capture is used.

It's hard to see what optimizations would be possible here.  Maybe, if we
> controlled the copying process, we could delay copying captured values into
> the block/lambda temporary, and instead just access them directly in the
> inner function?  But the inner function would have to be compiled two ways,
> one for when the variables have been copied and one when they're still
> in-place.  This is a lot of extra complexity.
>

One possibly-interesting optimization would be constant propagation of
captures: we statically know at the point where we emit the construction of
the lambda and its body what all the uses of each capture are, so this
isn't the normal somewhat-unbounded problem of constant propagation through
class members.

> 2. We don't know exactly when or how often the inner function will be
> called, but there's a definite point in the outer function past which the
> inner function will no longer be used.  This is true of some OpenMP
> features, "noescape" blocks in ObjC, and noescape blocks/closures in Swift.
>
> We can optimize this by eliminating copies and indirections: instead of
> aggressively copying values and local addresses into the block/lambda, we
> can just store the enclosing frame pointer.  If a capture is semantically
> "by reference", i.e. changes to it in the inner function are reflected in
> the outer and vice-versa, then we have to temporarily pin it into memory
> and treat its address as having escaped; we can then just access it
> relative to the captured frame pointer.  If a capture is semantically "by
> value", i.e. it copies the value of a variable at a specific point, then we
> just have to keep that current value live for a while in the outer frame;
> in many cases, we can prove that neither function will modify it, so this
> can safely degrade to the by-reference implementation.
>
> 3. We know exactly when the inner function will be called and it can be
> fit into the outer function's CFG.  This is true of some OpenMP features
> and some very specific Swift features.
>
> We can still do normal data-flow and control-flow analyses between the
> outer and inner functions; the only constraint is that we (probably) have
> to treat the beginning and end of the inner function as opaque barriers.
> So the unlowered function looks something like:
>
>   %initial_result = call %initial_result_t @begin_subfunction(%initial_args_t
> %initial_args)
>   // subfunction goes here
>   %final_result = call %final_result_t @end_subfunction(%final_args_t
> %final_args)
>
> and eventually that gets lowered into something like:
>
>   %final_result = call %final_result_t @do_with_subfunction(%initial_args,
> @subfunction)
>
>   define %final_args_t @subfunction(%initial_result_t %initial_result) {
>     // subfunction goes here
>     ret %final_args_t %final_args
>   }
>
> with data flow in and out handled basically like in case #2.  This is one
> class of thing that people want to be able to do with the coroutine
> proposal.
>
> John.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20170322/4cdb2e48/attachment.html>