[LLVMdev] RFC: Recursive inlining

Jeremy Lakeman Jeremy.Lakeman at gmail.com
Thu Feb 19 22:39:32 PST 2015


The manual transformation of this form seems to work well enough for me;

unsigned fibonacci(unsigned n){
  if (n <= 1)
    return n;
  return fibonacci(n - 1) + fibonacci(n - 2);
}

time fib(40)
real    0m1.100s
user    0m1.096s
sys    0m0.004s

after;

unsigned fibonacci(unsigned argn){
  unsigned stack[1024];
  unsigned pos=0;
  unsigned n=argn;
  unsigned ret;
  unsigned tmp1;
  stack[pos++]=0;

start:
  if (n <= 1){
    ret=n;
    goto pop;
  }

  stack[pos++]=n; // push n as it's live
  stack[pos++]=1; // push state
  n= n - 1; // set call argument
  goto start;

pop:
  // pop state and jump
  switch(stack[--pos]){
    case 0:
      return ret;
    case 1:
      n=stack[--pos]; // pop n
      stack[pos++]=ret; // push ret as it's live
      stack[pos++]=2; // push state
      n= n - 2; // set call argument
      goto start;
    case 2:
      tmp1 = stack[--pos]; // pop temporary return value
      ret = tmp1 + ret;
      goto pop;
  }
}

time fib(40)
real    0m0.772s
user    0m0.772s
sys    0m0.003s


But with your 3 return example the stack based code is ending up slower
after optimisations;
...
  if (n == 0){
    ret=0;
    goto pop;
  }
  if (n == 1){
    ret=1;
    goto pop;
  }
...

time fib(40)
real    0m1.233s
user    0m1.233s
sys    0m0.000s


On Thu, Feb 19, 2015 at 7:27 AM, Pablo Barrio <Pablo.Barrio at arm.com> wrote:

> Hi,
>
> Apologies for the very late response.
>
> We have manually tried the idea with a very simple Fibonacci sequence
> code. While being very very simple, the recursion cannot be handled by TRE.
> Because there are two recursive callsites, it also needs to keep some sort
> of state across iterations of the "while(stack not empty)" loop.
>
> We get between 2.5 and 8x slowdowns depending on which processor, and the
> branch mispredictions are huge (specially in x86: around 100 times higher).
> We are replacing the function call mechanism by a stack + switch statement
> to mimic the control flow for callsites. My understanding is that the
> branches in the switch are hardly predictable: they depend on a value taken
> from the stack at each iteration. Things could get harder as we increase
> the number of recursive callsites in a function (we increase the size of
> the switch).
>
> Also, virtually all processors beyond a certain level of sophistication
> (let's say, meant to run an OS) feature a return-stack prediction
> mechanism. This results in highly-efficient branch prediction for
> calls+returns. Since we are transforming the "call jumps" into "switch
> jumps", we are probably misusing that piece of hardware.
>
> Of course, this is not to say that this is a dead end. I paste the code
> here (original and "unrolled" version) so that you see the problem, and
> maybe you can spot something that I missed. Put your favorite queue
> implementation in there.
>
> Cheers,
> Pablo
>
>
> unsigned fibonacci(unsigned n){
>
>   if (n == 0)
>     return 0;
>   else if (n == 1)
>     return 1;
>   else
>     return fibonacci(n - 1) + fibonacci(n - 2);
> }
>
> unsigned fibonacci_opt(unsigned n){
>
>   // S0 = before 1st call, S1 = after 1st call, S2 = after 2nd call
>   enum {S0, S1, S2};
>
>   Stack s;
>   int ret;
>   init(&s, 256);
>   push(&s, n);
>   push(&s, S0);
>
>   while (!empty(s)){
>
>     /* print(&s); */
>
>     int state = pop(&s);
>     n = pop(&s);
>
>     int a, b;
>
>     switch(state){
>     case S0:
>
>       if (n <= 0)
>         ret = 0;
>
>       else if (n == 1)
>         ret = 1;
>
>       else{
>         // On "fake return" we need to move to the end of the first
> "recursive call".
>         push(&s, n);
>         push(&s, S1);
>
>         // Go for the first "recursive call"
>         push(&s, n - 1);
>         push(&s, S0);
>       }
>       break;
>
>     case S1:
>       // On "fake return", we want to recover the return of the previous
> call.
>       push(&s, ret);
>       push(&s, S2);
>
>       // Go for the second "recursive call"
>       push(&s, n - 2);
>       push(&s, S0);
>       break;
>
>     case S2:
>       ret += n;
>     }
>   }
>
>   delete(&s);
>   return ret;
> }
>
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: 06 February 2015 02:35
> To: James Molloy
> Cc: llvmdev at cs.uiuc.edu; Chandler Carruth; Pablo Barrio
> Subject: Re: RFC: Recursive inlining
>
> ----- Original Message -----
> > From: "James Molloy" <james at jamesmolloy.co.uk>
> > To: "Hal Finkel" <hfinkel at anl.gov>
> > Cc: llvmdev at cs.uiuc.edu, "Chandler Carruth" <chandlerc at google.com>,
> > "pablo barrio" <pablo.barrio at arm.com>
> > Sent: Thursday, February 5, 2015 9:18:35 AM
> > Subject: Re: RFC: Recursive inlining
> >
> >
> > Hi Hal,
> >
> > > As we had briefly mentioned on IRC, one way of forming this 'stack',
> > > and its associated 'cnt' variable, is to use dynamic stack
> > > allocation.
> >
> >
> >
> > I hadn't really reached a decision on the mechanics yet. However, your
> > suggestion while it can work within the current abilities of the IR,
> > has the disadvantage that it is using an extra stack slot for the link
> > pointer. I think if we are going to apply this optimization, we must
> > make sure we aren't going to blow the stack.
> >
> >
> > As users may already be just close enough to the max-stack boundary
> > that their applications don't crash, I think we should aim as much as
> > possible to keep the stack usage no higher than the recursive
> > equivalent. Holding the "state" is the equivalent of a return address,
> > so adding a stack link pointer would be the equivalent of a frame
> > address I suppose... That could be acceptable?
> >
>
> I agree. We should aim not to increase the amount of stack space used.
> This will require some target knowledge (because dynamic stack allocation
> has some target-dependent overhead because of alignment and other ABI
> requirements).
>
>  -Hal
>
> >
> > As I say, I haven't thought about the deep mechanics just yet.
> >
> >
> > James
> >
> > On Thu Feb 05 2015 at 2:48:48 PM Hal Finkel < hfinkel at anl.gov >
> > wrote:
> >
> >
> > ----- Original Message -----
> > > From: "James Molloy" < james at jamesmolloy.co.uk >
> > > To: llvmdev at cs.uiuc.edu , "Chandler Carruth" < chandlerc at google.com
> > > >, "Hal Finkel" < hfinkel at anl.gov >, "pablo barrio"
> > > < pablo.barrio at arm.com >
> > > Sent: Thursday, February 5, 2015 4:36:16 AM
> > > Subject: RFC: Recursive inlining
> > >
> > >
> > > Hi,
> > >
> > >
> > > The other day on IRC myself, Chandler, Hal and Pablo discussed
> > > recursive inlining. I've pondered a bit since then, and thought I'd
> > > get some background on the list with the aim of stimulating
> > > discussion.
> > >
> > >
> > > Our goal is to inline recursive functions. We have several workloads
> > > with recursive calls, and inlining them to a certain degree can
> > > improve performance by up to 30%.
> > >
> > >
> > > Inlining recursive functions is a slightly different problem to
> > > inlining non-recursive functions in that the inlining can never
> > > terminate, so you need to choose a factor (number of times) to
> > > inline. In this respect it is similar to loop unrolling, and
> > > Chandler suggested we should model it as such.
> > >
> > >
> > > We originally thought he meant by this that the mechanics of
> > > inlining would stay the same, but the heuristics would have to be
> > > changed.
> > > This is the approach taken by Rugina & Rinard [1], which is almost
> > > the sum of the literature on this topic I could find, which
> > > surprised me. However, the actual suggestion was to remove the
> > > recursion entirely, and replace it with a loop.
> > >
> > >
> > > Before I get into the mechanics of this, I'd like to explain what
> > > the possible optimization opportunities are for recursive functions
> > > turned into a loop.
> > >
> > >
> > > 1. Unrolling
> > > - This leads to bigger basic blocks (if they can be merged) and
> > > exposes more pipeline-level parallelism.
> > > - CSE between iterations.
> > > - Reduces backbranch overhead (generally negligible impact).
> > > 2. Commoning / LICM
> > > - Recursion doesn't have a vehicle for moving common code out into a
> > > dominating block (compare with LICM for loops).
> > >
> > >
> > > OK, first example:
> > >
> > >
> > > void ex1(i) {
> > > if (i == 0) return;
> > > a[i] = i+1;
> > > ex1(i-1);
> > > }
> > >
> > >
> > > gets converted to:
> > >
> > >
> > > while (1) {
> > > if (i == 0) break;
> > > a[i] = i+1;
> > > i = i-1;
> > > }
> > >
> > >
> > > Now, the above is the simplest case and is pure tail recursion. We
> > > can eliminate this currently with the TailRecursionElimination pass.
> > > The more complex case is where we have non-tail recusion and live
> > > variables.
> > >
> > >
> > > void ex2(i) {
> > > if (i == 0) return;
> > > j = a[i];
> > > ex2(i-1)
> > > b[j] = i;
> > > }
> > >
> > >
> > > This requires a stack being modelled and could be converted to:
> > >
> > >
> > > cnt = 0;
> > > while (1) {
> > > if (i == 0) break;
> > > j = a[i];
> > > stack[cnt++] = j;
> > > stack[cnt++] = i;
> > > ++cnt;
> > > --i;
> > > }
> > > while (cnt) {
> > > i = stack[--cnt];
> > > j = stack[--cnt];
> > > b[j] = i;
> > > }
> > >
> > >
> > > The above is as far as the discussion got on IRC. The following are
> > > my half-baked thoughts.
> > >
> > >
> > > The above transform works because the access pattern to the stack is
> > > strictly linear. Pushes are followed by pops. But I don't think this
> > > is the most interesting case for recursion in real programs. I see
> > > (qualitatively, poorly) 3 reasons for recursion in code that should
> > > go fast:
> > >
> > >
> > > 1. A fibonnacci function in a microbenchmark or compiler shootout.
> > > I
> > > think we can safely ignore this usecase...
> > > 2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
> > > 3. Traversing a data structure, where the recursion is simply to get
> > > to the leaves which is where all the fun stuff happens.
> > >
> > >
> > > Points 2 and 3 both share a trait that they're dividing/fanning out
> > > at each step:
> > >
> > >
> > > void ex3(i) {
> > > if (is_base(i)) {
> > > f();
> > > return;
> > > }
> > > g(i);
> > > ex3(i / 2);
> > > ex3(i / 2 + 1);
> > > }
> > >
> > >
> > > The access pattern of such a function is not linear. It is a
> > > pre-order walk of a (binary, in this case) tree. As such we can't
> > > use the two-loop transform above, but would have to form one loop
> > > and just manually implement the stack operations:
> > >
> > >
> > > cnt = 0
> > > stack[cnt++] = i;
> > > while (1) {
> > > i = stack[--cnt];
> > > if (is_base(i)) {
> > > f(); continue;
> > > }
> > > g(i);
> > > stack[cnt++] = i / 2 + 1;
> > > stack[cnt++] = i / 2;
> > > }
> > >
> > >
> > > OK, it's still a well-formed loop, we can still do useful operations
> > > on it like LICM or, if it's profitable, unrolling.
> > >
> > >
> > > Now what about a more complex case, like in the FFT butterfly
> > > algorithm:
> > >
> > >
> > > void ex4(i) {
> > > if (is_base(i)) {
> > > f(i); return;
> > > }
> > > g(i);
> > > ex4(i / 2);
> > > ex4(i / 2 + 1);
> > > h(i);
> > > }
> > >
> > >
> > > Here, we have to do work after the last recursive call. This
> > > presents a problem, because we don't have any of the caller's
> > > context when dealing with a node. Is a node N the first recursive
> > > call? or the second? Call-stack recursion can save two things: the
> > > state of variables and the return address which serves as another
> > > kind of state. The obvious answer is to turn this into a sort of
> > > state
> > > machine:
> > >
> > >
> > > cnt = 0
> > > stack[cnt++] = i;
> > > stack[cnt++] = STATE_1;
> > > while (1) {
> > > state = stack[--cnt];
> > > i = stack[--cnt];
> > >
> > >
> > > if (state == STATE_FINISH) {
> > > h(i);
> > > continue;
> > > }
> > >
> > >
> > > if (is_base(i)) {
> > > f(); continue;
> > > }
> > >
> > >
> > > g(i);
> > > stack[cnt++] = i; // Push ourself again, in the final state this
> > > time stack[cnt++] = STATE_FINISH; stack[cnt++] = i / 2 + 1;
> > > stack[cnt++] = STATE_1; stack[cnt++] = i / 2; stack[cnt++] =
> > > STATE_1; }
> >
> > As we had briefly mentioned on IRC, one way of forming this 'stack',
> > and its associated 'cnt' variable, is to use dynamic stack allocation.
> > You can do this in general at the IR level (using allocas that just
> > happen not to be in the entry block). If each stack block holds a
> > pointer to the previous one (as a linked list), then you just need to
> > maintain a pointer to the current block, and you don't need a separate
> > 'cnt' variable (you might also keep a pointer to the next block for
> > allocation reuse). Is that what you had in mind?
> >
> > -Hal
> >
> > >
> > >
> > > This solution can be generalised for if there is code in between the
> > > recursive calls (just add more states). It's getting more complex in
> > > terms of control flow though - the question is, is it worth it?
> > >
> > >
> > > Unrolling this loop is probably not going to usefully create larger
> > > basic blocks due to how complex the control flow is. I suppose we
> > > have the opportunity, if we unroll, of CSE and jump threading values
> > > through equivalent blocks that may gain us time. However, we can
> > > still do LICM which is probably a major improvement, and there are
> > > no calls/returns to speculate.
> > >
> > >
> > > This approach would have to be extended if the recursive calls
> > > didn't return void (for example it was a recursive reduction), but
> > > it's not impossible.
> > >
> > >
> > > Sorry for the long post. What do people think?
> > >
> > >
> > > Cheers,
> > >
> > >
> > > James
> > >
> > >
> > > [1]: http://link.springer.com/ chapter/10.1007/3-540-45574-4_ 3
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
> >
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium.  Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No:  2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No:  2548782
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150220/b11206e4/attachment.html>


More information about the llvm-dev mailing list