[LLVMdev] RFC: Recursive inlining

Thu Feb 5 02:36:16 PST 2015

Hi,

The other day on IRC myself, Chandler, Hal and Pablo discussed recursive
inlining. I've pondered a bit since then, and thought I'd get some
background on the list with the aim of stimulating discussion.

Our goal is to inline recursive functions. We have several workloads with
recursive calls, and inlining them to a certain degree can improve
performance by up to 30%.

Inlining recursive functions is a slightly different problem to inlining
non-recursive functions in that the inlining can never terminate, so you
need to choose a factor (number of times) to inline. In this respect it is
similar to loop unrolling, and Chandler suggested we should model it as
such.

We originally thought he meant by this that the mechanics of inlining would
stay the same, but the heuristics would have to be changed. This is the
approach taken by Rugina & Rinard [1], which is almost the sum of the
literature on this topic I could find, which surprised me. However, the
actual suggestion was to remove the recursion entirely, and replace it with
a loop.

Before I get into the mechanics of this, I'd like to explain what the
possible optimization opportunities are for recursive functions turned into
a loop.

1. Unrolling
  - This leads to bigger basic blocks (if they can be merged) and exposes
more pipeline-level parallelism.
  - CSE between iterations.
  - Reduces backbranch overhead (generally negligible impact).
2. Commoning / LICM
  - Recursion doesn't have a vehicle for moving common code out into a
dominating block (compare with LICM for loops).

OK, first example:

    void ex1(i) {
      if (i == 0) return;
      a[i] = i+1;
      ex1(i-1);
    }

gets converted to:

    while (1) {
      if (i == 0) break;
      a[i] = i+1;
      i = i-1;
    }

Now, the above is the simplest case and is pure tail recursion. We can
eliminate this currently with the TailRecursionElimination pass. The more
complex case is where we have non-tail recusion and live variables.

    void ex2(i) {
      if (i == 0) return;
      j = a[i];
      ex2(i-1)
      b[j] = i;
    }

This requires a stack being modelled and could be converted to:

    cnt = 0;
    while (1) {
      if (i == 0) break;
      j = a[i];
      stack[cnt++] = j;
      stack[cnt++] = i;
      ++cnt;
      --i;
    }
    while (cnt) {
      i = stack[--cnt];
      j = stack[--cnt];
      b[j] = i;
    }

The above is as far as the discussion got on IRC. The following are my
half-baked thoughts.

The above transform works because the access pattern to the stack is
strictly linear. Pushes are followed by pops. But I don't think this is the
most interesting case for recursion in real programs. I see (qualitatively,
poorly) 3 reasons for recursion in code that should go fast:

  1. A fibonnacci function in a microbenchmark or compiler shootout. I
think we can safely ignore this usecase...
  2. Divide and conquer algorithms. Quicksort, FFT butterflies, etc.
  3. Traversing a data structure, where the recursion is simply to get to
the leaves which is where all the fun stuff happens.

Points 2 and 3 both share a trait that they're dividing/fanning out at each
step:

    void ex3(i) {
      if (is_base(i)) {
        f();
        return;
      }
      g(i);
      ex3(i / 2);
      ex3(i / 2 + 1);
    }

The access pattern of such a function is not linear. It is a pre-order walk
of a (binary, in this case) tree. As such we can't use the two-loop
transform above, but would have to form one loop and just manually
implement the stack operations:

    cnt = 0
    stack[cnt++] = i;
    while (1) {
      i = stack[--cnt];
      if (is_base(i)) {
        f(); continue;
      }
      g(i);
      stack[cnt++] = i / 2 + 1;
      stack[cnt++] = i / 2;
    }

OK, it's still a well-formed loop, we can still do useful operations on it
like LICM or, if it's profitable, unrolling.

Now what about a more complex case, like in the FFT butterfly algorithm:

    void ex4(i) {
      if (is_base(i)) {
        f(i); return;
      }
      g(i);
      ex4(i / 2);
      ex4(i / 2 + 1);
      h(i);
    }

Here, we have to do work after the last recursive call. This presents a
problem, because we don't have any of the caller's context when dealing
with a node. Is a node N the first recursive call? or the second?
Call-stack recursion can save two things: the state of variables and the
return address which serves as another kind of state. The obvious answer is
to turn this into a sort of state machine:

    cnt = 0
    stack[cnt++] = i;
    stack[cnt++] = STATE_1;
    while (1) {
      state = stack[--cnt];
      i = stack[--cnt];

      if (state == STATE_FINISH) {
        h(i);
        continue;
      }

      if (is_base(i)) {
        f(); continue;
      }

      g(i);
      stack[cnt++] = i; // Push ourself again, in the final state this time
      stack[cnt++] = STATE_FINISH;
      stack[cnt++] = i / 2 + 1;
      stack[cnt++] = STATE_1;
      stack[cnt++] = i / 2;
      stack[cnt++] = STATE_1;
    }

This solution can be generalised for if there is code in between the
recursive calls (just add more states). It's getting more complex in terms
of control flow though - the question is, is it worth it?

Unrolling this loop is probably not going to usefully create larger basic
blocks due to how complex the control flow is. I suppose we have the
opportunity, if we unroll, of CSE and jump threading values through
equivalent blocks that may gain us time. However, we can still do LICM
which is probably a major improvement, and there are no calls/returns to
speculate.

This approach would have to be extended if the recursive calls didn't
return void (for example it was a recursive reduction), but it's not
impossible.

Sorry for the long post. What do people think?

Cheers,

James

[1]: http://link.springer.com/chapter/10.1007/3-540-45574-4_3
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150205/53b2cf8c/attachment.html>