[PATCH] D53706: [RecursionStackElimination]: Pass to eliminate recursions
John Brawn via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Tue Nov 13 06:00:32 PST 2018
john.brawn added a comment.
A few general comments:
For how to integrate this into the pass pipeline, I think it probably makes sense to put this just after inlining as this is kind of like inlining - we're seeing some function calls and eliminating them by putting extra stuff into this function (I experimented with this and it seemed to work). Instead of having the target insert the pass I think it makes more sense to have a heuristic to decide when to do this transformation that depends on the target, with a default implementation of "don't do this" (see next point).
When we do this transformation we are:
- Saving the cost of the call instruction
- Saving the cost of any saving and restoring of registers in the recursed function
- Saving the cost of having to save the unchanging arguments across several calls
- Incurring the cost of managing the call list and free list
- Have the opportunity to CSE/LICM subexpressions using the unchanging arguments
so in terms of heuristics we want to do it if (saved_cost + expected_opportunity) > incurred_cost, and the saved cost is dependant mainly on the cost of saving/restoring callee-saved registers (or at least that's what it looks like on aarch64). So it should involve some kind of calls to cost functions in the target.
The current implementation does the call list as one entry = one recursive call, so the cost of managing the call list is proportional to the number of recursive calls. We could instead have a 'chunk' of calls equal to the number of recursive calls, e.g.
struct tree {
int val;
struct tree *children[4];
};
void function_to_optimise(struct tree *p, const int n) {
if (!p)
return;
p->val += n;
f(p->children[0], n);
f(p->children[1], n);
f(p->children[2], n);
f(p->children[3], n);
}
struct chunk {
struct chunk *next;
long int idx;
struct tree *vals[4];
};
void function_optimised(struct tree *p, const double a, const double b) {
struct chunk *list = 0;
struct chunk *freelist = 0;
struct tree *current_p = p;
goto first;
while(list) {
// move to next chunk if at end
while (list->idx >= 4) {
struct chunk *tmp = list;
list = list->next;
tmp->next = freelist;
freelist = tmp;
if (!list)
return;
}
current_p = list->vals[list->idx++];
first:
// early exit
if (!current_p)
continue;
// do the operation
current_p->val += sin(a) + sin(b);
// add recursive calls to list
struct chunk *tmp;
if (freelist) {
tmp = freelist;
freelist = freelist->next;
} else {
tmp = alloca(sizeof(struct chunk));
}
tmp->idx = 0;
memcpy(tmp->vals, current_p->children, 4 * sizeof(struct tree *));
tmp->next = list;
list = tmp;
};
}
We now only have one allocation / list manipulation instead of one per recursive call, though at the loop head we have some extra complexity.
Also I think the current freelist handling isn't quite right - looking at the generated code it's checking on the _first_ time it adds to the worklist if there's an element in the freelist it can use so e.g. if we have 4 recursive calls to add and 2 freelist entries it will only use the first freelist entry and do 3 allocations, but it should do 2 allocations and use the 2 freelist entries. (Using the chunked approach would avoid this as only one chunk is ever added at once.)
Repository:
rL LLVM
https://reviews.llvm.org/D53706
More information about the llvm-commits
mailing list