<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On 21 March 2017 at 21:01, John McCall <span dir="ltr"><<a href="mailto:rjmccall@apple.com" target="_blank">rjmccall@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">> On Mar 21, 2017, at 7:24 PM, Hal Finkel <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>> wrote:<br>
> On 03/21/2017 03:02 PM, John McCall wrote:<br>
>>> On Mar 21, 2017, at 9:03 AM, Hal Finkel <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>> wrote:<br>
>>> Hi Richard, Chandler, John, et al.,<br>
>>><br>
>>> "Quick" question: What aspects, if any, of a C++ lambda (e.g. size, layout, alignment) leak into the ABI or are (potentially) semantically visible?<br>
>> C++ lambdas are anonymous local types that are constrained to appear in function bodies (including implicit "function" bodies, e.g. the initializers of global variables). The main semantic rule for local types like lambdas is that they're supposed to be the "same type" in all translation units that see the same function body, meaning (among many other things) that the ABI properties of the type must be the same. Now, most function bodies can only appear in only a single translation unit, so this rule is trivially satisfied and the compiler has total flexibility. But the body of an inline or template function can usually appear in multiple translation units, and so the ABI has to be consistent.<br>
>><br>
>>> Context...<br>
>>><br>
>>> We're investigating late lowering/outlining schemes to improve code generation for OpenMP and other parallel programming models. One important advantage of outlining late is that the IR-level optimizer still has access to the pointer-aliasing and loop information from the containing function. There are natural benefits to C++ lambdas as well. Lambdas that are used "in place" (i.e. only have one call site) should always be inlined, so the issues don't come up there, but for lambdas that have multiple call sites, or worse, escape (via std::function or some other type-erasure mechanism), we can get suboptimal optimization results for the body of the lambda. It would seem sad to fix this for OpenMP but not for lambdas.<br>
>>><br>
>>> However, optimizing before outlining could mean changes in what variables need to be captured (not semantically, but at the IR level), and so I'd like to think through what would constrain the optimizer's freedom to act in this regard.<br>
>> "Subfunction" representations are very useful for coroutine-esque situations where an inner function's control flow interactions with the outer function are essentially well-understood. That is, to be useful, you really want the blocks of the inner function to fit into the CFG of the outer function at a unique and appropriate place; otherwise you're just contorting the representation of the outer function in a way that will probably massively hamper optimization.<br>
>><br>
>> I can definitely see how there might be many situations in OpenMP that have this kind of CFG knowledge. Swift has some situations like this as well. Unfortunately, C++ lambdas and ObjC blocks are not a case where we have this knowledge, because the closure can be passed around and invoked arbitrarily many times and at arbitrary later points, which is not something that can ever fit cleanly into the CFG; pre-emptive "outlining" is just a superior representation for such cases.<br>
><br>
> This is a very good point. We only know that the lambda will execute, and perhaps execute multiple times, after some given point. The same is true (to some extent) for some OpenMP features. For example, we can generate OpenMP tasks in a function, and we know only that they'll run sometime before the end of the containing parallel region, taskwait directive, etc. This might be outside the scope of the function.<br>
><br>
> One interesting question is: To what extent does this affect our ability use analysis (AA, SCEV, etc.) of the variables at point of capture to optimize the body of the "subfunction." The aliasing properties of pointers, for example, and the bounds of captured induction variables, should be fine. However, if there's anything that might be rendered invalid by some later change in state (e.g. a write to memory), that wouldn't be usable. We also need to make sure that pointers appropriately appear to be captured (just as they would be if we used an outlined function representation). Luckily, I suppose, we need to do that anyway.<br>
<br>
</span>Right. I would say that there are three interesting levels of closure-capture optimization:<br>
<br>
1. We don't know when the inner function will be called; the outer function's frame may not still be intact. This is generally true of blocks and lambdas because they can be copied off the stack. If the block/lambda captures a reference to something in the outer function, U.B. might let us assume that the frame still exists, but I think only if the block/lambda actually uses that captured reference. Richard, let us know if there's some special rule that would help us here.<br></blockquote><div><br></div><div>Any use of a non-reference captured by reference would require the captured variable to still exist, and so we can in principle use a pointer to the outer function's stack frame for those captures (but not by-copy captures nor reference captures of references). But we can't assume that the outer function's stack frame will be active when such a lambda is called, only when such a capture is used.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
It's hard to see what optimizations would be possible here. Maybe, if we controlled the copying process, we could delay copying captured values into the block/lambda temporary, and instead just access them directly in the inner function? But the inner function would have to be compiled two ways, one for when the variables have been copied and one when they're still in-place. This is a lot of extra complexity.<br></blockquote><div><br></div><div>One possibly-interesting optimization would be constant propagation of captures: we statically know at the point where we emit the construction of the lambda and its body what all the uses of each capture are, so this isn't the normal somewhat-unbounded problem of constant propagation through class members.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
2. We don't know exactly when or how often the inner function will be called, but there's a definite point in the outer function past which the inner function will no longer be used. This is true of some OpenMP features, "noescape" blocks in ObjC, and noescape blocks/closures in Swift.<br>
<br>
We can optimize this by eliminating copies and indirections: instead of aggressively copying values and local addresses into the block/lambda, we can just store the enclosing frame pointer. If a capture is semantically "by reference", i.e. changes to it in the inner function are reflected in the outer and vice-versa, then we have to temporarily pin it into memory and treat its address as having escaped; we can then just access it relative to the captured frame pointer. If a capture is semantically "by value", i.e. it copies the value of a variable at a specific point, then we just have to keep that current value live for a while in the outer frame; in many cases, we can prove that neither function will modify it, so this can safely degrade to the by-reference implementation.<br>
<br>
3. We know exactly when the inner function will be called and it can be fit into the outer function's CFG. This is true of some OpenMP features and some very specific Swift features.<br>
<br>
We can still do normal data-flow and control-flow analyses between the outer and inner functions; the only constraint is that we (probably) have to treat the beginning and end of the inner function as opaque barriers. So the unlowered function looks something like:<br>
<br>
%initial_result = call %initial_result_t @begin_subfunction(%initial_<wbr>args_t %initial_args)<br>
// subfunction goes here<br>
%final_result = call %final_result_t @end_subfunction(%final_args_t %final_args)<br>
<br>
and eventually that gets lowered into something like:<br>
<br>
%final_result = call %final_result_t @do_with_subfunction(%initial_<wbr>args, @subfunction)<br>
<br>
define %final_args_t @subfunction(%initial_result_t %initial_result) {<br>
// subfunction goes here<br>
ret %final_args_t %final_args<br>
}<br>
<br>
with data flow in and out handled basically like in case #2. This is one class of thing that people want to be able to do with the coroutine proposal.<br>
<span class="HOEnZb"><font color="#888888"><br>
John.<br>
</font></span></blockquote></div><br></div></div>