[LLVMdev] RFC: Recursive inlining

Hal Finkel hfinkel at anl.gov
Thu Feb 5 18:34:47 PST 2015


----- Original Message -----
> From: "James Molloy" <james at jamesmolloy.co.uk>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc at google.com>, "pablo barrio" <pablo.barrio at arm.com>
> Sent: Thursday, February 5, 2015 9:18:35 AM
> Subject: Re: RFC: Recursive inlining
> 
> 
> Hi Hal,
> 
> > As we had briefly mentioned on IRC, one way of forming this
> > 'stack', and its associated 'cnt' variable, is to use dynamic
> > stack allocation.
> 
> 
> 
> I hadn't really reached a decision on the mechanics yet. However,
> your suggestion while it can work within the current abilities of
> the IR, has the disadvantage that it is using an extra stack slot
> for the link pointer. I think if we are going to apply this
> optimization, we must make sure we aren't going to blow the stack.
> 
> 
> As users may already be just close enough to the max-stack boundary
> that their applications don't crash, I think we should aim as much
> as possible to keep the stack usage no higher than the recursive
> equivalent. Holding the "state" is the equivalent of a return
> address, so adding a stack link pointer would be the equivalent of a
> frame address I suppose... That could be acceptable?
> 

I agree. We should aim not to increase the amount of stack space used. This will require some target knowledge (because dynamic stack allocation has some target-dependent overhead because of alignment and other ABI requirements).

 -Hal

> 
> As I say, I haven't thought about the deep mechanics just yet.
> 
> 
> James
> 
> On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel < hfinkel at anl.gov >
> wrote:
> 
> 
> ----- Original Message -----
> > From: "James Molloy" < james at jamesmolloy.co.uk >
> > To: llvmdev at cs.uiuc.edu , "Chandler Carruth" < chandlerc at google.com
> > >, "Hal Finkel" < hfinkel at anl.gov >, "pablo barrio"
> > < pablo.barrio at arm.com >
> > Sent: Thursday, February 5, 2015 4:36:16 AM
> > Subject: RFC: Recursive inlining
> > 
> > 
> > Hi,
> > 
> > 
> > The other day on IRC myself, Chandler, Hal and Pablo discussed
> > recursive inlining. I've pondered a bit since then, and thought I'd
> > get some background on the list with the aim of stimulating
> > discussion.
> > 
> > 
> > Our goal is to inline recursive functions. We have several
> > workloads
> > with recursive calls, and inlining them to a certain degree can
> > improve performance by up to 30%.
> > 
> > 
> > Inlining recursive functions is a slightly different problem to
> > inlining non-recursive functions in that the inlining can never
> > terminate, so you need to choose a factor (number of times) to
> > inline. In this respect it is similar to loop unrolling, and
> > Chandler suggested we should model it as such.
> > 
> > 
> > We originally thought he meant by this that the mechanics of
> > inlining
> > would stay the same, but the heuristics would have to be changed.
> > This is the approach taken by Rugina & Rinard [1], which is almost
> > the sum of the literature on this topic I could find, which
> > surprised me. However, the actual suggestion was to remove the
> > recursion entirely, and replace it with a loop.
> > 
> > 
> > Before I get into the mechanics of this, I'd like to explain what
> > the
> > possible optimization opportunities are for recursive functions
> > turned into a loop.
> > 
> > 
> > 1. Unrolling
> > - This leads to bigger basic blocks (if they can be merged) and
> > exposes more pipeline-level parallelism.
> > - CSE between iterations.
> > - Reduces backbranch overhead (generally negligible impact).
> > 2. Commoning / LICM
> > - Recursion doesn't have a vehicle for moving common code out into
> > a
> > dominating block (compare with LICM for loops).
> > 
> > 
> > OK, first example:
> > 
> > 
> > void ex1(i) {
> > if (i == 0) return;
> > a[i] = i+1;
> > ex1(i-1);
> > }
> > 
> > 
> > gets converted to:
> > 
> > 
> > while (1) {
> > if (i == 0) break;
> > a[i] = i+1;
> > i = i-1;
> > }
> > 
> > 
> > Now, the above is the simplest case and is pure tail recursion. We
> > can eliminate this currently with the TailRecursionElimination
> > pass.
> > The more complex case is where we have non-tail recusion and live
> > variables.
> > 
> > 
> > void ex2(i) {
> > if (i == 0) return;
> > j = a[i];
> > ex2(i-1)
> > b[j] = i;
> > }
> > 
> > 
> > This requires a stack being modelled and could be converted to:
> > 
> > 
> > cnt = 0;
> > while (1) {
> > if (i == 0) break;
> > j = a[i];
> > stack[cnt++] = j;
> > stack[cnt++] = i;
> > ++cnt;
> > --i;
> > }
> > while (cnt) {
> > i = stack[--cnt];
> > j = stack[--cnt];
> > b[j] = i;
> > }
> > 
> > 
> > The above is as far as the discussion got on IRC. The following are
> > my half-baked thoughts.
> > 
> > 
> > The above transform works because the access pattern to the stack
> > is
> > strictly linear. Pushes are followed by pops. But I don't think
> > this
> > is the most interesting case for recursion in real programs. I see
> > (qualitatively, poorly) 3 reasons for recursion in code that should
> > go fast:
> > 
> > 
> > 1. A fibonnacci function in a microbenchmark or compiler shootout.
> > I
> > think we can safely ignore this usecase...
> > 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
> > 3. Traversing a data structure, where the recursion is simply to
> > get
> > to the leaves which is where all the fun stuff happens.
> > 
> > 
> > Points 2 and 3 both share a trait that they're dividing/fanning out
> > at each step:
> > 
> > 
> > void ex3(i) {
> > if (is_base(i)) {
> > f();
> > return;
> > }
> > g(i);
> > ex3(i / 2);
> > ex3(i / 2 + 1);
> > }
> > 
> > 
> > The access pattern of such a function is not linear. It is a
> > pre-order walk of a (binary, in this case) tree. As such we can't
> > use the two-loop transform above, but would have to form one loop
> > and just manually implement the stack operations:
> > 
> > 
> > cnt = 0
> > stack[cnt++] = i;
> > while (1) {
> > i = stack[--cnt];
> > if (is_base(i)) {
> > f(); continue;
> > }
> > g(i);
> > stack[cnt++] = i / 2 + 1;
> > stack[cnt++] = i / 2;
> > }
> > 
> > 
> > OK, it's still a well-formed loop, we can still do useful
> > operations
> > on it like LICM or, if it's profitable, unrolling.
> > 
> > 
> > Now what about a more complex case, like in the FFT butterfly
> > algorithm:
> > 
> > 
> > void ex4(i) {
> > if (is_base(i)) {
> > f(i); return;
> > }
> > g(i);
> > ex4(i / 2);
> > ex4(i / 2 + 1);
> > h(i);
> > }
> > 
> > 
> > Here, we have to do work after the last recursive call. This
> > presents
> > a problem, because we don't have any of the caller's context when
> > dealing with a node. Is a node N the first recursive call? or the
> > second? Call-stack recursion can save two things: the state of
> > variables and the return address which serves as another kind of
> > state. The obvious answer is to turn this into a sort of state
> > machine:
> > 
> > 
> > cnt = 0
> > stack[cnt++] = i;
> > stack[cnt++] = STATE_1;
> > while (1) {
> > state = stack[--cnt];
> > i = stack[--cnt];
> > 
> > 
> > if (state == STATE_FINISH) {
> > h(i);
> > continue;
> > }
> > 
> > 
> > if (is_base(i)) {
> > f(); continue;
> > }
> > 
> > 
> > g(i);
> > stack[cnt++] = i; // Push ourself again, in the final state this
> > time
> > stack[cnt++] = STATE_FINISH;
> > stack[cnt++] = i / 2 + 1;
> > stack[cnt++] = STATE_1;
> > stack[cnt++] = i / 2;
> > stack[cnt++] = STATE_1;
> > }
> 
> As we had briefly mentioned on IRC, one way of forming this 'stack',
> and its associated 'cnt' variable, is to use dynamic stack
> allocation. You can do this in general at the IR level (using
> allocas that just happen not to be in the entry block). If each
> stack block holds a pointer to the previous one (as a linked list),
> then you just need to maintain a pointer to the current block, and
> you don't need a separate 'cnt' variable (you might also keep a
> pointer to the next block for allocation reuse). Is that what you
> had in mind?
> 
> -Hal
> 
> > 
> > 
> > This solution can be generalised for if there is code in between
> > the
> > recursive calls (just add more states). It's getting more complex
> > in
> > terms of control flow though - the question is, is it worth it?
> > 
> > 
> > Unrolling this loop is probably not going to usefully create larger
> > basic blocks due to how complex the control flow is. I suppose we
> > have the opportunity, if we unroll, of CSE and jump threading
> > values
> > through equivalent blocks that may gain us time. However, we can
> > still do LICM which is probably a major improvement, and there are
> > no calls/returns to speculate.
> > 
> > 
> > This approach would have to be extended if the recursive calls
> > didn't
> > return void (for example it was a recursive reduction), but it's
> > not
> > impossible.
> > 
> > 
> > Sorry for the long post. What do people think?
> > 
> > 
> > Cheers,
> > 
> > 
> > James
> > 
> > 
> > [1]: http://link.springer.com/ chapter/10.1007/3-540-45574-4_ 3
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory



More information about the llvm-dev mailing list