[LLVMdev] RFC: ARM Strategy for adding tail call optimization to thumb1 targets

Sun Dec 28 13:49:00 PST 2014

Hello,

after a while of code-reading and analysis I have come up with a plan on 
how to add tail call optimizations to thumb1 targets. As a beginner 
within LLVM, I would very much appreciate feedback and advice before 
proceeding further.

I identified two main complexities rendering tail call optimization more 
difficult for thumb1 targets (this is also already stated in some very 
old comment within ARMTargetLowering::IsEligibleForTailCallOptimization 
that does not seem to mach current state of the code at some minor points):

1.) Unlike the BL instruction the B instruction only reaches targets in 
up to ~2k byte distance. This means that for true tail calls the jump 
target address needs to be placed in a register.
2.) We need a scratch register to restore LR contents in the epilogue if 
LR has previously been pushed on the stack.

My plan is to implement the following:

1.) When lowering tail calls during DAG generation, arrange for the call 
target adress always being loaded into a register. So that actually only 
ARM::TCRETURNri pseudo opcodes will be generated for tail calls. 
(Penalty of this is will be one additional mov r12,r3 in case of exactly 
3 parameters being passed on the stack for constant addresses, see below).

2.) In case that r0 up to r3 are used for parameter passing. I noticed, 
that the thumb1 register allocator settings will not be able to allocate 
a suitable register. My plan is to somehow force the tail call jump 
target into hard reg r12 in this case before register allocation. If at 
least r3 is free, the register allocator will be able to find a register 
for the TCRETURNri pseudo instruction and we don't have a problem until 
epilogue generation.

QUESTION 1: How would be the "right" way to force the target address 
into the hard reg r12 in case that r0 up to r3 are used for params? 
Should this check and the "force to hard reg r12" code go to 
ARMTargetLowering::LowerCall or should this be done within the "legalize 
DAG" pass that seems to exist somewhere. Is there an example on how to 
force a value into a hard reg?

3.) After Steps 1.) and 2.), Thumb1FrameLowering::emitEpilogue will find 
TCRETURNri pseudo instructions for tail calls. We could detect this 
easily and probably handle this case within a new private member 
Thumb1FrameLowering::emitTailCallEpilogue ().

Thumb1FrameLowering::emitTailCallEpilogue () will have to arrange for
- updating the stack pointer,
- restoring GP registers pushed on the stack,
- restoring LR and
- jumping to the target address.

The problem is that restoring LR requires a scratch register because pop 
{lr} is not available. In any way, the target address will be available 
in a register.

Within Thumb1FrameLowering::emitTailCallEpilogue (), I plan to 
distinguish only 3 cases:

a) target address is in r12. This means, that we would have to generate 
*very* complicated code for restoring LR *before* poping the callee 
save-registers r4 ... r7, poping r4...r7, re-adjusting SP for the unused 
stack slot used for LR and then using BX r12. Instead of this very 
complicated code, I'd rather pop all callee-save-registers r4...r7 
except LR and use a BLX r12; pop {pc} sequence instead. I.e. the stack 
frame would grow by 4 bytes for the LR not yet popped. Note that this 
would prevent tail call optimizations as soon as one parameter is passed 
on the stack. (see side-note below on this issue).

b) target address of TCRETURNri is in r3. I would use "mov r12,r3" for 
freeing r3 as scratch register, pop all registers except LR and issue a 
pop {r3}; mov LR,r3; BX r12 sequence.

c) target address of TCRETURNri is not in r3. Assert that r3 is not used 
for parameters. Use r3 as scratch register as in case b) and proceed.

This code would be slightly sub-optimal in the case b) handling exactly 
3 parameters passed in registers if the target address is constant. In 
this case, loading the target address into r3 could have been done 
*after* restoring LR contents. However, the penalty in this case would 
be only the mov r12,r3 instruction and the complexity of handling 
constant and variable adresses differently and loading the operands to 
registers would not go into Thumb1FrameLowering::emitTailCallEpilogue ().

QUESTION 2: Is this a good idea or should I rather choose the *very* 
complicated code version or should the case a) be split into aa) for no 
parameters passed on the stack (using BLX r12; pop {pc} ) and ab) using 
the *very* complicated code version

QUESTION 3:

I also would appreciate suggestions on the development enviroment to 
use. I so far build LLC from the command shell and use Eclipse for 
debugging and stepping through the code using the local gdb. I'm a bit 
unhappy of my environment because doing so is extremely slow on my 
machine. The Eclipse indexer is incredibly slow and frequently 
references are not correctly resolved in the editor :-(. Also spawning 
gdb with the llc executable from eclipse takes about 2 minutes  for each 
run ... Waiting for a full build being finished (with clang and the 
other stuff) takes half a day on my machine. When running make opt and 
llc seem to be built first by make, so I'm currently canceling the 
build. But also when intterrupting make after linking llc each build of 
llc takes about 5 minutes on my machine ... Is there a way for speeding 
up builds for development? Is there an environment recommended for 
working on LLVM? (I did not find any hint in the documentation).

Yours,

Björn Haase

Side-note:

It is my understanding that the code

extern void peter(int a,int b, int c, int d, int e);

void
hugo (int a,int b, int c, int d, int e)
{
    peter (a,b,c,d,e);
    return;
}

should in fact be eligible for tail call optimization. However, I 
notice, that within ARMTargetLowering::IsEligibleForTailCallOptimization
the check for matching stack offsets for parameter #5 fails so that the 
function is marked not to be eligible for tail call optimizations.

The relevant code portion within 
ARMTargetLowering::IsEligibleForTailCallOptimization is:

     } else if (!VA.isRegLoc()) {
           if (!MatchingStackOffset(Arg, VA.getLocMemOffset(), Flags,
                                    MFI, MRI, TII))
             return false;
         }

MatchingStackOffset returns false for the 5th parameter. In my eyes this 
seems to be a bug and not intentional.? Should I file a bug ?