[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

Mon Jan 12 06:50:33 PST 2015

Some comments on the general approach:

> For Tail calls identified during DAG generation, the target address
> will
> be loaded into a register
> by use of the constant pool. If R3 is used for argument passing, the
> target address is forced
> to hard reg R12 in order to overcome limitations thumb1 register
> allocator with respect to
> the upper registers.

For the case when r3 is not used for argument passing doing this is
definitely a good idea, but if we have to BX via r12 then we have
to first do an LDR to some other register as there's no instruction
to LDR directly to r12. With current trunk LLVM and your patch for

  int calledfn(int,int,int,int);
  int test() {
    return calledfn(1,2,3,4);
  }

I get

test:
        push    {r4, lr}
        movs    r0, #1
        movs    r1, #2
        movs    r2, #3
        movs    r3, #4
        ldr     r4, .LCPI0_0
        mov     r12, r4
        ldr     r4, [sp, #4]
        mov     lr, r4
        pop     {r4}
        add     sp, #4
        bx      r12

r4 has been used to load the target address, which means we have to save
and restore r4 which means we're no better off than not tailcalling. And
you can also see the next problem:

> During epilog generation, spill register restoring will be done within
> the emit epilogue function.
> If LR happens to be spilled on the stack by the prologue, it's restored
> by use of a scratch register
> just before restoring the other registers.

POP is 1+N cycles whereas LDR is 2 cycles. If we need to LDR lr from the
stack then POP r4 then that's 2 (LDR) + 1+1 (POP) + 1 (MOV to lr) + 1
(ADD sp) = 6 cycles, but a POP {r4,lr} is just 3 cycles.

So I think tailcalling is only worthwhile if the function does not save
lr and r3 is free to hold the target address. Also needing consideration
is what happens if callee-saved registers other than lr need to be saved,
but I haven't looked into this.

A few comments on the patch itself (I've only given it a quick look over):

+  bool IsLRPartOfCalleeSavedInfo = false;
+
+  for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) {
     if (CSI.getReg() == ARM::LR)
-      IsV4PopReturn = true;
+    	IsLRPartOfCalleeSavedInfo = true;
+    }
+
+  bool IsV4PopReturn = IsLRPartOfCalleeSavedInfo;
   IsV4PopReturn &= STI.hasV4TOps() && !STI.hasV5TOps();

+  if (IsTailCallReturn) {
+    MBBI = MBB.getLastNonDebugInstr();
+
+    // First restore callee saved registers. Unlike for normal returns
+    // this is *not* done in restoreCalleeSavedRegisters.
+    const std::vector<CalleeSavedInfo> &CSI(MFI->getCalleeSavedInfo());
+
+    bool IsR4IncludedInCSI = false;
+    bool IsLRIncludedInCSI = false;
+    for (unsigned i = CSI.size(); i != 0; --i) {
+      unsigned Reg = CSI[i-1].getReg();
+      if (Reg == ARM::R4)
+        IsR4IncludedInCSI = true;
+      if (Reg == ARM::LR)
+        IsLRIncludedInCSI = true;
+    }

You set IsLRPartOfCalleeSavedInfo then right afterwards duplicate the work
in setting IsLRIncludedInCSI.

+      // ARMv4T and tail call returns require BX, see emitEpilogue
+      if ((STI.hasV4TOps() && !STI.hasV5TOps()))

The comment says 'tail call returns require BX', but the code doesn't do that.

John