[llvm-commits] patch: PUSH/POP optimization for THUMB

Tue Nov 6 16:40:07 PST 2012

Unfortunately I don't know how to implement this optimization simpler. My objection stands though, the cost of this optimization doesn't seem to be worth the gain. In addition to the code complexity, I'm also concerned about compile time impact. e.g. The function below does a linear search.

+/// This tells if there are live stack objects that will
+/// require SP update instruction in the function prolog.
+/// If so, stack-padding is preffered than pushing a dummy register
+/// to align the stack.
+bool
+ARMFrameLowering::hasLiveStackObjects(MachineFunction &MF,
+                                  RegScavenger *RS) const{
+  MachineFrameInfo *MFI = MF.getFrameInfo();
+  for (unsigned i = 0, e = MFI->getObjectIndexEnd(); i != e; ++i) {
+    if (MFI->isDeadObjectIndex(i))
+      continue;
+
+    if(MFI->getObjectSize(i)>0)
+      return true;
+  }
+
+  return false;
+}
+

Can you give us a better idea how much code size benefit this provides? Perhaps you can try it on the llvm test suite and show the numbers?

Evan

On Nov 5, 2012, at 5:24 PM, liangh at codeaurora.org wrote:

> Hi Evan and Gordon,
> 
> Would you please let me know your idea about this patch after my
> explainations?
> 
> Thanks.
> -Liang
> 
>> Hi Evan and Gordon,
>> 
>> Thank you very much for your feedback.
>> 
>> First of all, I agree with what Gordon said that ARMv7 and after have a
>> wider data bus to hide the load-multi latency. But still we can save the
>> power in this way. More important, as Evan has pointed out, this patch is
>> primarily a code size optimization, enabled by "-mthumb -O.." (as long as
>> not "-O0"). I didn't limited it to "-Os" because sometime people want
>> "-mthumb -O3/-O2" and this optimization won't hurt them.
>> 
>> Second, this case comes up fairly frequent. Although I don't have
>> statistics I do seem them a lot, and GCC uses the same stragegy as I
>> implemented, that's why I worked on this. BTW. no matter how frequently
>> this case comes up, the benefit comes for free. Plus, we enable this
>> optimization for codesize mainly.
>> 
>> Finally, let's talk about the complexity.
>> The function hasLiveStackObjects() contains small simple calculations only
>> and has been decleared as "inline". It looks cleaner in such orgnization,
>> but it can be flatten into the caller if you prefer.
>> 
>> I was trying to simplify it, but there is no easy way to do so because it
>> involves two passes -- ARMFrameLowering (to set part of the information)
>> and PEI (to finally insert the instructions). (PEI does have reasons to
>> have been so complicated and messy.)
>> 
>> Actrually, the main "complexity" in this patch comes from the fact that I
>> need to pre-calculate (via hasLiveStackObjects()) to tell if an
>> SP-adjustment instruction will be following, since the instruction itself
>> has not yet been inserted by this early. The code in hasLiveStackObjects()
>> is in fact partially redundant to some code in PEI.
>> 
>> An alternative way is to remove this condition (as well as
>> hasLiveStackObjects()), and do stack-padding unconditionally:
>> 
>> ##################
>> foo:
>> push  {r11, lr}
>> ...
>> pop {r11, pc}
>> ##################
>> 
>> would be changed to:
>> ##################
>> foo:
>> push  {lr}
>> sub sp, #4
>> ...
>> add sp, #4
>> pop {pc}
>> ##################
>> 
>> But, I don't know if people is going to like this.
>> 
>> In short, the patch is not very complicated as it seemed, and not easy to
>> simplify. But if you do have any good idea to significantly simplify it,
>> please let me know and I would happy to rethink it.
>> 
>> Thanks.
>> -Liang
>> 
>>> I see this as primarily a code size optimization. That's fine as it is.
>>> However, the implementation should be localized in one spot for a small
>>> optimization like this. The PEI code is already fairly messy and
>>> complicated and we should avoid making it worse.
>>> 
>>> Evan
>>> 
>>> On Oct 20, 2012, at 8:19 PM, Gordon Keiser <gkeiser at arxan.com> wrote:
>>> 
>>>> I was just looking at the instruction timings for load/store vs.
>>>> load/store multiple.    It looks like on modern chips both of these
>>>> will
>>>> eat the same number of cycles, but on ARMv6T2 and older the extra
>>>> register being pushed might introduce an extra cycle of register lock
>>>> delay in the rare case where LR is used immediately (by BL maybe).
>>>> 
>>>> At the end of the function, it seems like a POP of multiple items
>>>> (including PC) will always take at least a cycle longer than a single
>>>> pop of the PC.
>>>> 
>>>> ARM is somewhat sparse on their timing data for newer chips, though,
>>>> saying:   "The complexity of the Cortex-A9 processor makes it
>>>> impossible
>>>> to calculate precise timing information manually."
>>>> 
>>>> Information from
>>>> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0338g/DDI0338G_arm1156t2s_r0p4_trm.pdf
>>>> and
>>>> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf
>>>> 
>>>> So maybe this optimization should be run at -Os builds and / or when
>>>> building for v6T2 and below.
>>>> 
>>>> Gordon Keiser
>>>> Software Development Engineer
>>>> Arxan Technologies
>>>> gkeiser at arxan.com  www.arxan.com
>>>> 
>>>>> -----Original Message-----
>>>>> From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-
>>>>> bounces at cs.uiuc.edu] On Behalf Of Evan Cheng
>>>>> Sent: Saturday, October 20, 2012 2:14 PM
>>>>> To: liangh at codeaurora.org
>>>>> Cc: llvm-commits at cs.uiuc.edu
>>>>> Subject: Re: [llvm-commits] patch: PUSH/POP optimization for THUMB
>>>>> 
>>>>> Hi Liang,
>>>>> 
>>>>> It seems the cost is too high for the reward. I understand the benefit
>>>>> of using a
>>>>> 16-bit instruction here but does this comes up frequently? I would
>>>>> have
>>>>> been
>>>>> ok with this if it's a localized change. But your proposal includes a
>>>>> target hook
>>>>> hasLiveStackObjects() that seems to serve no purpose other than this
>>>>> minor
>>>>> optimization. Is it possible for you to significantly simplify the
>>>>> patch?
>>>>> 
>>>>> Evan
>>>>> 
>>>>> On Oct 19, 2012, at 4:04 PM, liangh at codeaurora.org wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> This patch implements a PUSH/POP optimization.
>>>>>> 
>>>>>> Currently, LLVM pushes/pops a dummy register to keep the stack
>>>>>> aligned
>>>>>> when the number of registers need to be pushed/popped is an odd.
>>>>>> For example: "r11" here is pushed as a pad to align the stack.
>>>>>> ##################
>>>>>> foo:
>>>>>> push  {r11, lr}
>>>>>> sub sp, #8
>>>>>> add r0, sp, #4
>>>>>> bl  goo
>>>>>> add sp, #8
>>>>>> pop {r11, pc}
>>>>>> ##################
>>>>>> 
>>>>>> Enabled by this patch, the stack can be padded to align if the PUSH
>>>>>> instruction is followed by an SP-adjusting instruction, so that the
>>>>>> useless efforts of storing and loading the dummy register can be
>>>>>> saved.
>>>>>> Also, a PUSH instruction can be encoded with 16 but not 32 bits if it
>>>>>> is pushing a single register.
>>>>>> For example, the above code will be changed to:
>>>>>> ##################
>>>>>> foo:
>>>>>> push  {lr}
>>>>>> sub sp, #12
>>>>>> add r0, sp, #4
>>>>>> bl  goo
>>>>>> add sp, #12
>>>>>> pop {pc}
>>>>>> ##################
>>>>>> 
>>>>>> This optimization only takes effect in thumb mode with a non-zero
>>>>>> optimization level.
>>>>>> 
>>>>>> Could you please review the attached patch?
>>>>>> Thanks.
>>>>>> 
>>>>>> -Liang<0001-PUSH-POP-
>>>>> Optimization.patch>______________________________
>>>>>> _________________
>>>>>> llvm-commits mailing list
>>>>>> llvm-commits at cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>> 
>>>>> _______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>> 
>>> 
>> 
>> 
>> 
> 
>