[PATCH] D53706: [RecursionStackElimination]: Pass to eliminate recursions

Wed Nov 14 03:31:50 PST 2018

marels added a comment.

Hi & Thank you for the input:

In https://reviews.llvm.org/D53706#1296831, @john.brawn wrote:

> A few general comments:
>
> [For how to integrate this into the pass pipeline ...]:

I will take a look into this.

> When we do this transformation we are:
> 
> - Saving the cost of the call instruction
> - Saving the cost of any saving and restoring of registers in the recursed function
> - Saving the cost of having to save the unchanging arguments across several calls
> - Incurring the cost of managing the call list and free list
> - Have the opportunity to CSE/LICM subexpressions using the unchanging arguments so in terms of heuristics we want to do it if (saved_cost + expected_opportunity) > incurred_cost, and the saved cost is dependant mainly on the cost of saving/restoring callee-saved registers (or at least that's what it looks like on aarch64). So it should involve some kind of calls to cost functions in the target.

This is an interesting idea I will take into account for my next update.

> Also I think the current freelist handling isn't quite right - looking at the generated code it's checking on the _first_ time it adds to the worklist if there's an element in the freelist it can use so e.g. if we have 4 recursive calls to add and 2 freelist entries it will only use the first freelist entry and do 3 allocations, but it should do 2 allocations and use the 2 freelist entries. (Using the chunked approach would avoid this as only one chunk is ever added at once.)

I am not sure what you mean with "4 recursive calls". Do you mean depth or width. The number of alloca's executed depends on the recursion depth and not on the width. Compared to you approach (see next point) currently the items are already alloced in chunks where the "next" pointers are assigned after allocation.

Could you please provide a more details on this. Sorry, I cannot follow your point.

> [chunked approach]

I head a similar idea while implementing this but I did not implement because I thought it required non uniform code generation for the first and the remaining recursive calls. Note in case of 4 recursive calls, like in your example, only three items are maintained in the worklist; the first recursion is never put in the worklist because it can be handled just like tail recursion. However I think you example can be adopted to do this as well.

Thinking of the reducing the overhead for freelist management, I think there is a simple way to adopt the current implementation as well:

Currently the exit path looks like this:

  loopentry:
    ...

  return_path:
    if (worklist == nullptr)
      return;

    mark = marker.isLastItem();
    if (mark) {
      // The values of FIRST_ITEM and  LAST_TEM is computed at compile time (it is a Constant)
      currentlist[LAST_ITEM].next = freelist;
      freelist = currentlist[FIRST_ITEM];
    }

    current_params = worklist.params;
    worklist = worklist.next;
    goto loopentry;

This implementation requires a check for adding to the freelist and a check for termination in each loop iteration. However it is an invariant that:

  (mark == true) -> (listitem->next != nullptr)

Using this information the exitpath can be changed to

  loopentry:
    ...

  return_path:
    mark = marker.isLastItem();
    if (mark) {
      currentlist[LAST_ITEM].next = freelist;
      freelist = currentlist[FIRST_ITEM];
      if (worklist == nullptr)
        return;
    }

    current_params = worklist.params;
    worklist = worklist.next;
    goto loopentry;

In this case it is also save that worklist is not null when fetching the parameters for the next iteration from it. Note that worklist always maintains the information for the **next** iteration.

Comparing this to the your proposal I think it should behave similarly regarding performance. (In compare for inner callse, 2 compares for the last call, one compare to put items into the worklist)
Note that currently the data is allocated in chunks of (n-1) items, where n is the witdh of the recursion. The main difference is the way the list item in the chunk is marked. You use an extra index per chunk, the current implementation some uses information stored in each item within a chunk (Bit in pointer (I can push a prototype implementation for this), extra field in item, using address and stack properties and comparing `worklist < worklist->next`).

Summary:

Alloca Overhead:

- Currently: On alloca the inner structure of the allocated chunk needs to be initialised and creates an overhead (item[0].next = &item[1], ...); This steps are not executed when using items within the freelist.
- Your Approach: On alloca no overhead is created

Insert To Worklist Overhead:

- Current: Pointer manipulation
- Your Approach: Pointer Manipulation and 1 store to initialise the index

Exit Path:

- Currently: Check Return, Handle Freelist, Pointer Manipulation
- Currently: [using the modifications above]: Check Freelist, Check Return [iff last item reached], Pointer Manipulation
- Your Approach: Check Index, Increment Index, Check Return [iff last item reached], Pointer Manipulation [iff last item reached]

Also, in your approach the index can be omitted if the recursion witdh is 2, but similar things are done in the current approach.

I think both approaches are similar regarding performance. I will try this approach and do some measurements, unless you (@john.brawn) already dis some performance tests and provide results.
If you proposal is beneficial or equal in performance I I will adopt it et least for the following reasons:

1. We can get rid of the marker code which complicates the pass and also may be target dependent (Growing, Shrinking Stack; or Stack alignment to tag pointer).
2. Slightly less memory used
3. By changing your increasing to a decreasing pointer, the pass becomes more flexible and can easily adopted to more rare cases like:

  void fn(....) {
    //execute body

    if (...) {
      // three recursions
      fn(...)
      fn(h(...), ...)
      fn(h(...), ...)
    }
    if (...) {
      // five recursions
      fn(...)
      fn(h(...), ...)
      fn(h(...), ...)
      fn(h(...), ...)
      fn(h(...), ...)
    }
  }

Repository:
  rL LLVM

https://reviews.llvm.org/D53706