[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

Mon Jan 12 14:43:58 PST 2015

Hello John,

thank you for your feedback.

Am 12.01.2015 um 15:50 schrieb John Brawn:
> Some comments on the general approach:
>
>> For Tail calls identified during DAG generation, the target address
>> will
>> be loaded into a register
>> by use of the constant pool. If R3 is used for argument passing, the
>> target address is forced
>> to hard reg R12 in order to overcome limitations thumb1 register
>> allocator with respect to
>> the upper registers.
> For the case when r3 is not used for argument passing doing this is
> definitely a good idea, but if we have to BX via r12 then we have
> to first do an LDR to some other register as there's no instruction
> to LDR directly to r12. With current trunk LLVM and your patch for
>
>    int calledfn(int,int,int,int);
>    int test() {
>      return calledfn(1,2,3,4);
>    }
>
> I get
>
> test:
>          push    {r4, lr}
>          movs    r0, #1
>          movs    r1, #2
>          movs    r2, #3
>          movs    r3, #4
>          ldr     r4, .LCPI0_0
>          mov     r12, r4
>          ldr     r4, [sp, #4]
>          mov     lr, r4
>          pop     {r4}
>          add     sp, #4
>          bx      r12
>
> r4 has been used to load the target address, which means we have to save
> and restore r4 which means we're no better off than not tailcalling.
Concerning speed: Yes. I agree.
For this specific extremely short function: Yes. I agree.

The benefit to expect (in my opinion) is *not* speed or code size. It's 
stack usage. On a typical cortex-m0 with say 64k Flash, 8k RAM and 
functions with stack frame of say 128 bytes it actually may easily make 
a difference if the stack frame is released or not when returning from a 
function and calling the next one. Actually, I am just working on a 
communication stack code that does do exactly this very frequently and 
unfortunately, some tail calls occuring in the biggest stack-suckers 
happen use 4 parameters :-().

And you can also see the next problem:
>> During epilog generation, spill register restoring will be done within
>> the emit epilogue function.
>> If LR happens to be spilled on the stack by the prologue, it's restored
>> by use of a scratch register
>> just before restoring the other registers.
> POP is 1+N cycles whereas LDR is 2 cycles. If we need to LDR lr from the
> stack then POP r4 then that's 2 (LDR) + 1+1 (POP) + 1 (MOV to lr) + 1
> (ADD sp) = 6 cycles, but a POP {r4,lr} is just 3 cycles.
>
> So I think tailcalling is only worthwhile if the function does not save
> lr and r3 is free to hold the target address. Also needing consideration
> is what happens if callee-saved registers other than lr need to be saved,
> but I haven't looked into this.
I am aware of that and I fully agree with you in that with respect to 
speed and code size code will not get improved or worse.

Why I still believe tail optimizations to be beneficial in many 
occasions is my personal experience on real-world projects. The last 15 
work on projects on "v6m-scale" microcontroller systems in different 
companies did teach me that mostly the true issue is not program memory 
and speed but RAM and stack usage. With respect to a typical RAM / flash 
ratio of 8k/64k, I'd like to say that 1 byte of RAM spared is roughly 
"worth" 8 bytes of flash memory. (I know and admit, that the question 
wether the tail calls would actually save RAM strongly depends on the 
question whether the biggest stack frame shows up due to tail calls or 
inner calls.)

If I'd be asked, I'd strongly advocate for a "optimize for RAM" policy 
for v6m even if making some limited compromises for code size and speed. 
(see also my other mail from yesterday). I don't think that there is a 
wrong or right "answer" to this question. It is risky to base the design 
choice on one single person's bias, so I'd suggest to ask others for 
their opinion. Maybe there is also some useful benchmarking code around.

If I knew how to do implement it, I'd be suggesting some heuristics at 
DAG generation for enabling tail call optimization if the expected stack 
frame size gets larger than, say 32 bytes or a specific compile switch.

> A few comments on the patch itself (I've only given it a quick look over):
>
> +  bool IsLRPartOfCalleeSavedInfo = false;
> +
> +  for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) {
>       if (CSI.getReg() == ARM::LR)
> -      IsV4PopReturn = true;
> +    	IsLRPartOfCalleeSavedInfo = true;
> +    }
> +
> +  bool IsV4PopReturn = IsLRPartOfCalleeSavedInfo;
>     IsV4PopReturn &= STI.hasV4TOps() && !STI.hasV5TOps();
>   
> +  if (IsTailCallReturn) {
> +    MBBI = MBB.getLastNonDebugInstr();
> +
> +    // First restore callee saved registers. Unlike for normal returns
> +    // this is *not* done in restoreCalleeSavedRegisters.
> +    const std::vector<CalleeSavedInfo> &CSI(MFI->getCalleeSavedInfo());
> +
> +    bool IsR4IncludedInCSI = false;
> +    bool IsLRIncludedInCSI = false;
> +    for (unsigned i = CSI.size(); i != 0; --i) {
> +      unsigned Reg = CSI[i-1].getReg();
> +      if (Reg == ARM::R4)
> +        IsR4IncludedInCSI = true;
> +      if (Reg == ARM::LR)
> +        IsLRIncludedInCSI = true;
> +    }
>
> You set IsLRPartOfCalleeSavedInfo then right afterwards duplicate the work
> in setting IsLRIncludedInCSI.
Thank you for pointing out this You are right. That did stem from copy 
and paste. I previously had this part in the epilogue generation but in 
the register restoring function, where I did need the duplication.

>
> +      // ARMv4T and tail call returns require BX, see emitEpilogue
> +      if ((STI.hasV4TOps() && !STI.hasV5TOps()))
>
> The comment says 'tail call returns require BX', but the code doesn't do that.
Yes. In fact, I did use the tailjump instruction, which actually finally 
is implemented BX, but I agree, It should better be re-phrased with 
"require branch address in register".

As an additional question. When doing some execution testing in the 
simulator, I did stumble myself over an issue with predicates in the 
instructions.

+        AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tPUSH))
+            .addReg(ARM::R4,RegState::Kill));

It seems that the predicate information needs to be placed at a very 
specific position within the instruction operand list in order to 
prevent internal compiler errors. For push/pop, it seems that the 
predicate is required to be the first parameter, for others, there seems 
to be the requirement to place it directly after the last explicit 
operand. Is there some documentation on where to place the predicates?

Yours,

Björn.