[PATCH] D53706: [RecursionStackElimination]: Pass to eliminate recursions

Tue Nov 13 06:00:32 PST 2018

john.brawn added a comment.

A few general comments:

For how to integrate this into the pass pipeline, I think it probably makes sense to put this just after inlining as this is kind of like inlining - we're seeing some function calls and eliminating them by putting extra stuff into this function (I experimented with this and it seemed to work). Instead of having the target insert the pass I think it makes more sense to have a heuristic to decide when to do this transformation that depends on the target, with a default implementation of "don't do this" (see next point).

When we do this transformation we are:

- Saving the cost of the call instruction
- Saving the cost of any saving and restoring of registers in the recursed function
- Saving the cost of having to save the unchanging arguments across several calls
- Incurring the cost of managing the call list and free list
- Have the opportunity to CSE/LICM subexpressions using the unchanging arguments

so in terms of heuristics we want to do it if (saved_cost + expected_opportunity) > incurred_cost, and the saved cost is dependant mainly on the cost of saving/restoring callee-saved registers (or at least that's what it looks like on aarch64). So it should involve some kind of calls to cost functions in the target.

The current implementation does the call list as one entry = one recursive call, so the cost of managing the call list is proportional to the number of recursive calls. We could instead have a 'chunk' of calls equal to the number of recursive calls, e.g.

  struct tree {
    int val;
    struct tree *children[4];
  };

  void function_to_optimise(struct tree *p, const int n) {
    if (!p)
      return;

    p->val += n;
    f(p->children[0], n);
    f(p->children[1], n);
    f(p->children[2], n);
    f(p->children[3], n);
  }

  struct chunk {
    struct chunk *next;
    long int idx;
    struct tree *vals[4];
  };
  void function_optimised(struct tree *p, const double a, const double b) {
    struct chunk *list = 0;
    struct chunk *freelist = 0;
    struct tree *current_p = p;
    goto first;

    while(list) {
      // move to next chunk if at end
      while (list->idx >= 4) {
        struct chunk *tmp = list;
        list = list->next;
        tmp->next = freelist;
        freelist = tmp;
        if (!list)
          return;
      }
      current_p = list->vals[list->idx++];

    first:
      // early exit
      if (!current_p)
        continue;

      // do the operation
      current_p->val += sin(a) + sin(b);

      // add recursive calls to list
      struct chunk *tmp;
      if (freelist) {
        tmp = freelist;
        freelist = freelist->next;
      } else {
        tmp = alloca(sizeof(struct chunk));
      }
      tmp->idx = 0;
      memcpy(tmp->vals, current_p->children, 4 * sizeof(struct tree *));
      tmp->next = list;
      list = tmp;
    };
  }

We now only have one allocation / list manipulation instead of one per recursive call, though at the loop head we have some extra complexity.

Also I think the current freelist handling isn't quite right - looking at the generated code it's checking on the _first_ time it adds to the worklist if there's an element in the freelist it can use so e.g. if we have 4 recursive calls to add and 2 freelist entries it will only use the first freelist entry and do 3 allocations, but it should do 2 allocations and use the 2 freelist entries. (Using the chunked approach would avoid this as only one chunk is ever added at once.)

Repository:
  rL LLVM

https://reviews.llvm.org/D53706