[LLVMdev] [RFC] [PATCH] add tail call optimization to thumb1-only targets

Bjoern Haase bjoern.m.haase at web.de
Sat Jan 10 16:31:02 PST 2015


Hello,

find enclosed a first patch for adding tail call optimizations for 
thumb1 targets.
I assume that this list is the right place for publishing patches for 
review?

Since this is my first proposal for LLVM, I'd very much appreciate your 
feedback.

What the patch is meant to do:

For Tail calls identified during DAG generation, the target address will 
be loaded into a register
by use of the constant pool. If R3 is used for argument passing, the 
target address is forced
to hard reg R12 in order to overcome limitations thumb1 register 
allocator with respect to
the upper registers.

I decided to fetch the target address to a register by a constant pool 
lookup because when
analyzing the code I found out, that the mechanisms are prepared also 
for situations, where
parameters are both passed in regs and on the stack. This would not be 
possible when using
a BL // pop {pc} sequence within the epilogue since this would change 
the stack offsets.

During epilog generation, spill register restoring will be done within 
the emit epilogue function.
If LR happens to be spilled on the stack by the prologue, it's restored 
by use of a scratch register
just before restoring the other registers.

I have so far tested the code by hand with a number of tests by 
analyzing generated assembly.
In the lit testsuite I get 4 failures which I attribute at a first 
analysis to the fact that the generated code for tail calls
results in different output that no longer matches the expectation strings.

Yours,

Björn
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sampleTests.tar.gz
Type: application/x-gzip
Size: 1119 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150111/57000648/attachment.bin>
-------------- next part --------------
Index: Thumb1FrameLowering.cpp
===================================================================
--- Thumb1FrameLowering.cpp	(Revision 225589)
+++ Thumb1FrameLowering.cpp	(Arbeitskopie)
@@ -323,11 +323,18 @@
 }
 
 void Thumb1FrameLowering::emitEpilogue(MachineFunction &MF,
-                                   MachineBasicBlock &MBB) const {
+                                       MachineBasicBlock &MBB) const {
   MachineBasicBlock::iterator MBBI = MBB.getLastNonDebugInstr();
   assert((MBBI->getOpcode() == ARM::tBX_RET ||
-          MBBI->getOpcode() == ARM::tPOP_RET) &&
-         "Can only insert epilog into returning blocks");
+          MBBI->getOpcode() == ARM::tPOP_RET ||
+          MBBI->getOpcode() == ARM::TCRETURNri)
+          && "Can only insert epilog into returning blocks "
+             "and tail calls with address in regs.");
+
+  bool IsTailCallReturn = false;
+  if (MBBI->getOpcode() == ARM::TCRETURNri)
+	  IsTailCallReturn = true;
+
   DebugLoc dl = MBBI->getDebugLoc();
   MachineFrameInfo *MFI = MF.getFrameInfo();
   ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
@@ -351,8 +358,8 @@
     if (NumBytes - ArgRegsSaveSize != 0)
       emitSPUpdate(MBB, MBBI, TII, dl, *RegInfo, NumBytes - ArgRegsSaveSize);
   } else {
-    // Unwind MBBI to point to first LDR / VLDRD.
-    if (MBBI != MBB.begin()) {
+    // Unwind MBBI to point to first LDR / VLDRD. Not for tail call returns!
+    if ((MBBI != MBB.begin()) && (!IsTailCallReturn)) {
       do
         --MBBI;
       while (MBBI != MBB.begin() && isCSRestore(MBBI, CSRegs));
@@ -395,12 +402,119 @@
     }
   }
 
-  bool IsV4PopReturn = false;
-  for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo())
+  bool IsLRPartOfCalleeSavedInfo = false;
+
+  for (const CalleeSavedInfo &CSI : MFI->getCalleeSavedInfo()) {
     if (CSI.getReg() == ARM::LR)
-      IsV4PopReturn = true;
+    	IsLRPartOfCalleeSavedInfo = true;
+    }
+
+  bool IsV4PopReturn = IsLRPartOfCalleeSavedInfo;
   IsV4PopReturn &= STI.hasV4TOps() && !STI.hasV5TOps();
 
+  if (IsTailCallReturn) {
+    MBBI = MBB.getLastNonDebugInstr();
+
+    // First restore callee saved registers. Unlike for normal returns
+    // this is *not* done in restoreCalleeSavedRegisters.
+    const std::vector<CalleeSavedInfo> &CSI(MFI->getCalleeSavedInfo());
+
+    bool IsR4IncludedInCSI = false;
+    bool IsLRIncludedInCSI = false;
+    for (unsigned i = CSI.size(); i != 0; --i) {
+      unsigned Reg = CSI[i-1].getReg();
+      if (Reg == ARM::R4)
+        IsR4IncludedInCSI = true;
+      if (Reg == ARM::LR)
+        IsLRIncludedInCSI = true;
+    }
+
+    MachineFunction &MF = *MBB.getParent();
+    const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
+
+    // We need to additionally push/pop R4 in case that LR reconstruction
+    // for tail calls requires R4 as scratch register.
+    bool IsR4ToBeAdditionallyAddedToPopIns = false;
+
+    if (IsLRIncludedInCSI) {
+      // We need to restore LR and need a scratch register for this purpose
+      int StackSlotForSavedLR = CSI.size() - 1;
+      assert (StackSlotForSavedLR >= 0 && "Wrong Stack slot for LR.");
+
+      // Make sure that R4 may be used as scratch. Add an additional tPUSH (R4)
+      // if necessary.
+      if (!IsR4IncludedInCSI) {
+        IsR4ToBeAdditionallyAddedToPopIns = true;
+
+        AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tPUSH))
+            .addReg(ARM::R4,RegState::Kill));
+
+        StackSlotForSavedLR ++;
+      }
+
+      AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tLDRspi))
+        .addReg(ARM::R4, RegState::Define)
+        .addReg(ARM::SP)
+        .addImm(StackSlotForSavedLR));
+
+      AddDefaultPred(BuildMI(MBB, MBBI, dl, TII.get(ARM::tMOVr))
+        .addReg(ARM::LR, RegState::Define)
+        .addReg(ARM::R4, RegState::Kill));
+    }
+
+    MachineInstrBuilder MIB = BuildMI(MF, dl, TII.get(ARM::tPOP));
+    AddDefaultPred(MIB);
+
+    bool NumRegs = false;
+    for (unsigned i = CSI.size(); i != 0; --i) {
+      unsigned Reg = CSI[i-1].getReg();
+
+      if (Reg == ARM::LR)
+        continue;
+
+      MIB.addReg(Reg, getDefRegState(true));
+      NumRegs = true;
+    }
+
+    if (IsR4ToBeAdditionallyAddedToPopIns) {
+      MIB.addReg(ARM::R4, getDefRegState(true));
+      NumRegs = true;
+    }
+
+    // It's illegal to emit pop instruction without operands.
+    if (NumRegs)
+      MBB.insert(MBBI, &*MIB);
+    else
+      MF.DeleteMachineInstr(MIB);
+
+    if (IsLRIncludedInCSI) {
+      const Thumb1RegisterInfo *RegInfo =
+          static_cast<const Thumb1RegisterInfo *>
+                      (MF.getSubtarget().getRegisterInfo());
+
+      // Re-adjust stack pointer for LR content still residing on the stack.
+      emitSPUpdate(MBB, MBBI, TII, dl, *RegInfo, 4);
+    }
+
+    MachineOperand &JumpTarget = MBBI->getOperand(0);
+
+    assert (MBBI->getOpcode() == ARM::TCRETURNri);
+    DebugLoc dl = MBBI->getDebugLoc();
+
+    BuildMI(MBB, MBBI, dl,
+            TII.get(ARM::tTAILJMPr))
+     .addReg(JumpTarget.getReg(), RegState::Kill);
+
+    MachineInstr *NewMI = std::prev(MBBI);
+    for (unsigned i = 1, e = MBBI->getNumOperands(); i != e; ++i)
+      NewMI->addOperand(MBBI->getOperand(i));
+
+    // Delete the pseudo instruction TCRETURN.
+    MBB.erase(MBBI);
+    MBBI = NewMI;
+    return;
+  }
+
   // Unlike T2 and ARM mode, the T1 pop instruction cannot restore
   // to LR, and we can't pop the value directly to the PC since
   // we need to update the SP after popping the value. So instead
@@ -501,15 +615,25 @@
                             MachineBasicBlock::iterator MI,
                             const std::vector<CalleeSavedInfo> &CSI,
                             const TargetRegisterInfo *TRI) const {
+
+  MachineBasicBlock::iterator MBBI = MBB.getLastNonDebugInstr();
+  bool IsTailCallReturn = false;
+  if(MBBI->getOpcode() == ARM::TCRETURNri)
+    IsTailCallReturn = true;
+
   if (CSI.empty())
     return false;
 
+  // We will handle callee saving in Epilogue generation and not here.
+  if (IsTailCallReturn)
+    return true;
+
   MachineFunction &MF = *MBB.getParent();
   ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
   const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
-
   bool isVarArg = AFI->getArgRegsSaveSize() > 0;
   DebugLoc DL = MI->getDebugLoc();
+
   MachineInstrBuilder MIB = BuildMI(MF, DL, TII.get(ARM::tPOP));
   AddDefaultPred(MIB);
 
@@ -517,12 +641,15 @@
   for (unsigned i = CSI.size(); i != 0; --i) {
     unsigned Reg = CSI[i-1].getReg();
     if (Reg == ARM::LR) {
+
       // Special epilogue for vararg functions. See emitEpilogue
       if (isVarArg)
         continue;
-      // ARMv4T requires BX, see emitEpilogue
-      if (STI.hasV4TOps() && !STI.hasV5TOps())
+
+      // ARMv4T and tail call returns require BX, see emitEpilogue
+      if ((STI.hasV4TOps() && !STI.hasV5TOps()))
         continue;
+
       Reg = ARM::PC;
       (*MIB).setDesc(TII.get(ARM::tPOP_RET));
       MIB.copyImplicitOps(&*MI);
Index: ARMSubtarget.cpp
===================================================================
--- ARMSubtarget.cpp	(Revision 225589)
+++ ARMSubtarget.cpp	(Arbeitskopie)
@@ -262,7 +262,7 @@
     SupportsTailCall = !isTargetIOS() || !getTargetTriple().isOSVersionLT(5, 0);
   } else {
     IsR9Reserved = ReserveR9;
-    SupportsTailCall = !isThumb1Only();
+    SupportsTailCall = true;
   }
 
   if (Align == DefaultAlign) {
Index: ARMISelLowering.cpp
===================================================================
--- ARMISelLowering.cpp	(Revision 225589)
+++ ARMISelLowering.cpp	(Arbeitskopie)
@@ -1671,6 +1671,26 @@
     InFlag = SDValue();
   }
 
+  // For thumb1 targets, if R3 is used for argument passing, we need
+  // to place the call target address in IP (i.e. R12).
+  bool IsR3UsedForArgumentPassing = false;
+  if (RegsToPass.size() >= 4) {
+    IsR3UsedForArgumentPassing = true;
+  }
+
+  bool IsCallAddressMoveToRegisterRequired = false;
+  bool CallAdressShallBeForcedToHardRegR12 = false;
+
+  if  (EnableARMLongCalls || (isTailCall && Subtarget->isThumb1Only() ))
+  {
+	  IsCallAddressMoveToRegisterRequired = true;
+
+	  if (isTailCall
+	      && IsR3UsedForArgumentPassing
+	      && Subtarget->isThumb1Only() )
+		  CallAdressShallBeForcedToHardRegR12 = true;
+  }
+
   // If the callee is a GlobalAddress/ExternalSymbol node (quite common, every
   // direct call is) turn it into a TargetGlobalAddress/TargetExternalSymbol
   // node so that legalize doesn't hack it.
@@ -1679,10 +1699,12 @@
   bool isLocalARMFunc = false;
   ARMFunctionInfo *AFI = MF.getInfo<ARMFunctionInfo>();
 
-  if (EnableARMLongCalls) {
+  if (IsCallAddressMoveToRegisterRequired) {
     assert((Subtarget->isTargetWindows() ||
+            (isTailCall && Subtarget->isThumb1Only()) ||
             getTargetMachine().getRelocationModel() == Reloc::Static) &&
            "long-calls with non-static relocation model!");
+
     // Handle a global address or an external symbol. If it's not one of
     // those, the target's already in a register, so we don't need to do
     // anything extra.
@@ -1695,11 +1717,14 @@
 
       // Get the address of the callee into a register
       SDValue CPAddr = DAG.getTargetConstantPool(CPV, getPointerTy(), 4);
+
       CPAddr = DAG.getNode(ARMISD::Wrapper, dl, MVT::i32, CPAddr);
+
       Callee = DAG.getLoad(getPointerTy(), dl,
                            DAG.getEntryNode(), CPAddr,
                            MachinePointerInfo::getConstantPool(),
                            false, false, false, 0);
+
     } else if (ExternalSymbolSDNode *S=dyn_cast<ExternalSymbolSDNode>(Callee)) {
       const char *Sym = S->getSymbol();
 
@@ -1785,6 +1810,12 @@
     }
   }
 
+  if (CallAdressShallBeForcedToHardRegR12) {
+	  Chain = DAG.getCopyToReg(Chain, dl, ARM::R12,
+                             Callee,Chain.getValue(1));
+    Callee = DAG.getRegister (ARM::R12,getPointerTy());
+  }
+
   // FIXME: handle tail calls differently.
   unsigned CallOpc;
   bool HasMinSizeAttr = MF.getFunction()->getAttributes().hasAttribute(
@@ -2000,26 +2031,6 @@
   if (isCalleeStructRet || isCallerStructRet)
     return false;
 
-  // FIXME: Completely disable sibcall for Thumb1 since Thumb1RegisterInfo::
-  // emitEpilogue is not ready for them. Thumb tail calls also use t2B, as
-  // the Thumb1 16-bit unconditional branch doesn't have sufficient relocation
-  // support in the assembler and linker to be used. This would need to be
-  // fixed to fully support tail calls in Thumb1.
-  //
-  // Doing this is tricky, since the LDM/POP instruction on Thumb doesn't take
-  // LR.  This means if we need to reload LR, it takes an extra instructions,
-  // which outweighs the value of the tail call; but here we don't know yet
-  // whether LR is going to be used.  Probably the right approach is to
-  // generate the tail call here and turn it back into CALL/RET in
-  // emitEpilogue if LR is used.
-
-  // Thumb1 PIC calls to external symbols use BX, so they can be tail calls,
-  // but we need to make sure there are enough registers; the only valid
-  // registers are the 4 used for parameters.  We don't currently do this
-  // case.
-  if (Subtarget->isThumb1Only())
-    return false;
-
   // Externally-defined functions with weak linkage should not be
   // tail-called on ARM when the OS does not support dynamic
   // pre-emption of symbols, as the AAELF spec requires normal calls
@@ -2365,7 +2376,7 @@
   if (!CI->isTailCall() || getTargetMachine().Options.DisableTailCalls)
     return false;
 
-  return !Subtarget->isThumb1Only();
+  return true;
 }
 
 // ConstantPool, JumpTable, GlobalAddress, and ExternalSymbol are lowered as


More information about the llvm-dev mailing list