[llvm-commits] patch: PUSH/POP optimization for THUMB

liangh at codeaurora.org liangh at codeaurora.org
Mon Nov 5 17:24:02 PST 2012


Hi Evan and Gordon,

Would you please let me know your idea about this patch after my
explainations?

Thanks.
-Liang

> Hi Evan and Gordon,
>
> Thank you very much for your feedback.
>
> First of all, I agree with what Gordon said that ARMv7 and after have a
> wider data bus to hide the load-multi latency. But still we can save the
> power in this way. More important, as Evan has pointed out, this patch is
> primarily a code size optimization, enabled by "-mthumb -O.." (as long as
> not "-O0"). I didn't limited it to "-Os" because sometime people want
> "-mthumb -O3/-O2" and this optimization won't hurt them.
>
> Second, this case comes up fairly frequent. Although I don't have
> statistics I do seem them a lot, and GCC uses the same stragegy as I
> implemented, that's why I worked on this. BTW. no matter how frequently
> this case comes up, the benefit comes for free. Plus, we enable this
> optimization for codesize mainly.
>
> Finally, let's talk about the complexity.
> The function hasLiveStackObjects() contains small simple calculations only
> and has been decleared as "inline". It looks cleaner in such orgnization,
> but it can be flatten into the caller if you prefer.
>
> I was trying to simplify it, but there is no easy way to do so because it
> involves two passes -- ARMFrameLowering (to set part of the information)
> and PEI (to finally insert the instructions). (PEI does have reasons to
> have been so complicated and messy.)
>
> Actrually, the main "complexity" in this patch comes from the fact that I
> need to pre-calculate (via hasLiveStackObjects()) to tell if an
> SP-adjustment instruction will be following, since the instruction itself
> has not yet been inserted by this early. The code in hasLiveStackObjects()
> is in fact partially redundant to some code in PEI.
>
> An alternative way is to remove this condition (as well as
> hasLiveStackObjects()), and do stack-padding unconditionally:
>
> ##################
> foo:
>  push  {r11, lr}
>  ...
>  pop {r11, pc}
> ##################
>
> would be changed to:
> ##################
> foo:
>  push  {lr}
>  sub sp, #4
>  ...
>  add sp, #4
>  pop {pc}
> ##################
>
> But, I don't know if people is going to like this.
>
> In short, the patch is not very complicated as it seemed, and not easy to
> simplify. But if you do have any good idea to significantly simplify it,
> please let me know and I would happy to rethink it.
>
> Thanks.
> -Liang
>
>> I see this as primarily a code size optimization. That's fine as it is.
>> However, the implementation should be localized in one spot for a small
>> optimization like this. The PEI code is already fairly messy and
>> complicated and we should avoid making it worse.
>>
>> Evan
>>
>> On Oct 20, 2012, at 8:19 PM, Gordon Keiser <gkeiser at arxan.com> wrote:
>>
>>> I was just looking at the instruction timings for load/store vs.
>>> load/store multiple.    It looks like on modern chips both of these
>>> will
>>> eat the same number of cycles, but on ARMv6T2 and older the extra
>>> register being pushed might introduce an extra cycle of register lock
>>> delay in the rare case where LR is used immediately (by BL maybe).
>>>
>>> At the end of the function, it seems like a POP of multiple items
>>> (including PC) will always take at least a cycle longer than a single
>>> pop of the PC.
>>>
>>> ARM is somewhat sparse on their timing data for newer chips, though,
>>> saying:   "The complexity of the Cortex-A9 processor makes it
>>> impossible
>>> to calculate precise timing information manually."
>>>
>>> Information from
>>> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0338g/DDI0338G_arm1156t2s_r0p4_trm.pdf
>>> and
>>> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0388i/DDI0388I_cortex_a9_r4p1_trm.pdf
>>>
>>> So maybe this optimization should be run at -Os builds and / or when
>>> building for v6T2 and below.
>>>
>>> Gordon Keiser
>>> Software Development Engineer
>>> Arxan Technologies
>>> gkeiser at arxan.com  www.arxan.com
>>>
>>>> -----Original Message-----
>>>> From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-
>>>> bounces at cs.uiuc.edu] On Behalf Of Evan Cheng
>>>> Sent: Saturday, October 20, 2012 2:14 PM
>>>> To: liangh at codeaurora.org
>>>> Cc: llvm-commits at cs.uiuc.edu
>>>> Subject: Re: [llvm-commits] patch: PUSH/POP optimization for THUMB
>>>>
>>>> Hi Liang,
>>>>
>>>> It seems the cost is too high for the reward. I understand the benefit
>>>> of using a
>>>> 16-bit instruction here but does this comes up frequently? I would
>>>> have
>>>> been
>>>> ok with this if it's a localized change. But your proposal includes a
>>>> target hook
>>>> hasLiveStackObjects() that seems to serve no purpose other than this
>>>> minor
>>>> optimization. Is it possible for you to significantly simplify the
>>>> patch?
>>>>
>>>> Evan
>>>>
>>>> On Oct 19, 2012, at 4:04 PM, liangh at codeaurora.org wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> This patch implements a PUSH/POP optimization.
>>>>>
>>>>> Currently, LLVM pushes/pops a dummy register to keep the stack
>>>>> aligned
>>>>> when the number of registers need to be pushed/popped is an odd.
>>>>> For example: "r11" here is pushed as a pad to align the stack.
>>>>> ##################
>>>>> foo:
>>>>> push  {r11, lr}
>>>>> sub sp, #8
>>>>> add r0, sp, #4
>>>>> bl  goo
>>>>> add sp, #8
>>>>> pop {r11, pc}
>>>>> ##################
>>>>>
>>>>> Enabled by this patch, the stack can be padded to align if the PUSH
>>>>> instruction is followed by an SP-adjusting instruction, so that the
>>>>> useless efforts of storing and loading the dummy register can be
>>>>> saved.
>>>>> Also, a PUSH instruction can be encoded with 16 but not 32 bits if it
>>>>> is pushing a single register.
>>>>> For example, the above code will be changed to:
>>>>> ##################
>>>>> foo:
>>>>> push  {lr}
>>>>> sub sp, #12
>>>>> add r0, sp, #4
>>>>> bl  goo
>>>>> add sp, #12
>>>>> pop {pc}
>>>>> ##################
>>>>>
>>>>> This optimization only takes effect in thumb mode with a non-zero
>>>>> optimization level.
>>>>>
>>>>> Could you please review the attached patch?
>>>>> Thanks.
>>>>>
>>>>> -Liang<0001-PUSH-POP-
>>>> Optimization.patch>______________________________
>>>>> _________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>>
>
>
>





More information about the llvm-commits mailing list