[LLVMdev] RFC: ARM Strategy for adding tail call optimization to thumb1 targets
Bjoern Haase
bjoern.m.haase at web.de
Sun Dec 28 13:49:00 PST 2014
Hello,
after a while of code-reading and analysis I have come up with a plan on
how to add tail call optimizations to thumb1 targets. As a beginner
within LLVM, I would very much appreciate feedback and advice before
proceeding further.
I identified two main complexities rendering tail call optimization more
difficult for thumb1 targets (this is also already stated in some very
old comment within ARMTargetLowering::IsEligibleForTailCallOptimization
that does not seem to mach current state of the code at some minor points):
1.) Unlike the BL instruction the B instruction only reaches targets in
up to ~2k byte distance. This means that for true tail calls the jump
target address needs to be placed in a register.
2.) We need a scratch register to restore LR contents in the epilogue if
LR has previously been pushed on the stack.
My plan is to implement the following:
1.) When lowering tail calls during DAG generation, arrange for the call
target adress always being loaded into a register. So that actually only
ARM::TCRETURNri pseudo opcodes will be generated for tail calls.
(Penalty of this is will be one additional mov r12,r3 in case of exactly
3 parameters being passed on the stack for constant addresses, see below).
2.) In case that r0 up to r3 are used for parameter passing. I noticed,
that the thumb1 register allocator settings will not be able to allocate
a suitable register. My plan is to somehow force the tail call jump
target into hard reg r12 in this case before register allocation. If at
least r3 is free, the register allocator will be able to find a register
for the TCRETURNri pseudo instruction and we don't have a problem until
epilogue generation.
QUESTION 1: How would be the "right" way to force the target address
into the hard reg r12 in case that r0 up to r3 are used for params?
Should this check and the "force to hard reg r12" code go to
ARMTargetLowering::LowerCall or should this be done within the "legalize
DAG" pass that seems to exist somewhere. Is there an example on how to
force a value into a hard reg?
3.) After Steps 1.) and 2.), Thumb1FrameLowering::emitEpilogue will find
TCRETURNri pseudo instructions for tail calls. We could detect this
easily and probably handle this case within a new private member
Thumb1FrameLowering::emitTailCallEpilogue ().
Thumb1FrameLowering::emitTailCallEpilogue () will have to arrange for
- updating the stack pointer,
- restoring GP registers pushed on the stack,
- restoring LR and
- jumping to the target address.
The problem is that restoring LR requires a scratch register because pop
{lr} is not available. In any way, the target address will be available
in a register.
Within Thumb1FrameLowering::emitTailCallEpilogue (), I plan to
distinguish only 3 cases:
a) target address is in r12. This means, that we would have to generate
*very* complicated code for restoring LR *before* poping the callee
save-registers r4 ... r7, poping r4...r7, re-adjusting SP for the unused
stack slot used for LR and then using BX r12. Instead of this very
complicated code, I'd rather pop all callee-save-registers r4...r7
except LR and use a BLX r12; pop {pc} sequence instead. I.e. the stack
frame would grow by 4 bytes for the LR not yet popped. Note that this
would prevent tail call optimizations as soon as one parameter is passed
on the stack. (see side-note below on this issue).
b) target address of TCRETURNri is in r3. I would use "mov r12,r3" for
freeing r3 as scratch register, pop all registers except LR and issue a
pop {r3}; mov LR,r3; BX r12 sequence.
c) target address of TCRETURNri is not in r3. Assert that r3 is not used
for parameters. Use r3 as scratch register as in case b) and proceed.
This code would be slightly sub-optimal in the case b) handling exactly
3 parameters passed in registers if the target address is constant. In
this case, loading the target address into r3 could have been done
*after* restoring LR contents. However, the penalty in this case would
be only the mov r12,r3 instruction and the complexity of handling
constant and variable adresses differently and loading the operands to
registers would not go into Thumb1FrameLowering::emitTailCallEpilogue ().
QUESTION 2: Is this a good idea or should I rather choose the *very*
complicated code version or should the case a) be split into aa) for no
parameters passed on the stack (using BLX r12; pop {pc} ) and ab) using
the *very* complicated code version
QUESTION 3:
I also would appreciate suggestions on the development enviroment to
use. I so far build LLC from the command shell and use Eclipse for
debugging and stepping through the code using the local gdb. I'm a bit
unhappy of my environment because doing so is extremely slow on my
machine. The Eclipse indexer is incredibly slow and frequently
references are not correctly resolved in the editor :-(. Also spawning
gdb with the llc executable from eclipse takes about 2 minutes for each
run ... Waiting for a full build being finished (with clang and the
other stuff) takes half a day on my machine. When running make opt and
llc seem to be built first by make, so I'm currently canceling the
build. But also when intterrupting make after linking llc each build of
llc takes about 5 minutes on my machine ... Is there a way for speeding
up builds for development? Is there an environment recommended for
working on LLVM? (I did not find any hint in the documentation).
Yours,
Björn Haase
Side-note:
It is my understanding that the code
extern void peter(int a,int b, int c, int d, int e);
void
hugo (int a,int b, int c, int d, int e)
{
peter (a,b,c,d,e);
return;
}
should in fact be eligible for tail call optimization. However, I
notice, that within ARMTargetLowering::IsEligibleForTailCallOptimization
the check for matching stack offsets for parameter #5 fails so that the
function is marked not to be eligible for tail call optimizations.
The relevant code portion within
ARMTargetLowering::IsEligibleForTailCallOptimization is:
} else if (!VA.isRegLoc()) {
if (!MatchingStackOffset(Arg, VA.getLocMemOffset(), Flags,
MFI, MRI, TII))
return false;
}
MatchingStackOffset returns false for the 5th parameter. In my eyes this
seems to be a bug and not intentional.? Should I file a bug ?
More information about the llvm-dev
mailing list