[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences

Sun Dec 21 14:11:41 PST 2014

On 12/21/14 4:27 AM, Kuperstein, Michael M wrote:
>
> Which performance guidelines are you referring to?
>
Table C-21 in "Intel® 64 and IA-32 Architectures Optimization Reference 
Manual", September 2014.

It hasn't changed.  It still lists push and pop instructions as 2-3 
times more expensive as mov.  And that's not taking into account any 
optimizations that might get undone by the stack pointer changing. I'm 
just speculating, but I suspect that move being faster has to do with 
not having to modify a register every time.

With that as a basis, the fastest entry/exit sequences just use 
subl/addl on the stack pointer and don't use push at all.  For most C 
functions, you don't even have to materialize a frame pointer (if the 
unwind mechanisms are set up to handle that).  Not that I am 
recommending changing the x86_32 code generation to do that.
>
> I’m not that familiar with decade-old CPUs, but to the best of my 
> knowledge, this is not true on current hardware.
>
> There is one specific circumstance where PUSHes should be avoided – 
> for Atom/Silvermont processors, the memory form of PUSH is 
> inefficient, so the register-freeing optimization below may not be 
> profitable (see 14.3.3.6 and 15.3.1.2 in the Intel optimization 
> reference manual).
>
> Having said that, one distinct possibility is to have the heuristic 
> make different decisions depending on the optimization flags, that is 
> be much more aggressive for optsize functions.
>
> *From:*Herbie Robinson [mailto:HerbieRobinson at verizon.net]
> *Sent:* Sunday, December 21, 2014 10:58
> *To:* Kuperstein, Michael M; LLVMdev at cs.uiuc.edu
> *Subject:* Re: [LLVMdev] [RFC] [X86] Mov to push transformation in 
> x86-32 call sequences
>
> According to the Intel performance guidelines, pushes are 
> significantly slower than moves to the extent they should be avoided 
> as much as possible.  It's been a decade since I was dealing with 
> this; so, I don't remember the numbers, but I'm pretty sure the 
> changes you are proposing will slow the code down.
>
> People who care about speed more than code size are probably not going 
> to like this very much...
>
>
> On 12/21/14 3:17 AM, Kuperstein, Michael M wrote:
>
>     Hello all,
>
>     In r223757 I’ve committed a patch that performs, for the 32-bit
>     x86 calling convention, the transformation of MOV instructions
>     that push function arguments onto the stack into actual PUSH
>     instructions.
>
>     For example, it will transform this:
>
>     subl    $16, %esp
>
>     movl    $4, 12(%esp)
>
>     movl    $3, 8(%esp)
>
>     movl    $2, 4(%esp)
>
>     movl    $1, (%esp)
>
>     calll   _func
>
>     addl    $16, %esp
>
>     Into this:
>
>     pushl   $4
>
>     pushl   $3
>
>     pushl   $2
>
>     pushl   $1
>
>     calll   _func
>
>     addl    $16, %esp
>
>     The main motivation for this is code size (a “pushl $4” is 2
>     bytes, a “movl $4, 12(%esp)” is 7 bytes), but there are some other
>     advantages, as shown below.
>
>     The way this works in r223757 is by intercepting call frame
>     simplification in the Prolog/Epilog Inserter, and replacing the
>     mov sequence with pushes. Right now it only handles functions
>     which do not have a reserved call frame (a small minority of
>     cases), and I'd like to extend it to cover other cases where it is
>     profitable.
>
>     The currently implemented approach has a couple of drawbacks:
>
>     1) Push vs. having a reserved call frame:
>     This transformation is always profitable when we do not have a
>     reserved call frame. When a reserved frame can be used, however,
>     there is a trade-off. For example, in a function that contains
>     only one call site, and no other stack allocations, pushes are a
>     clear win, since having a reserved call frame wouldn't save any
>     instructions. On the other hand, if a function contains 10 call
>     sites, and only one of them can use pushes, then it is most
>     probably a loss – not reserving a call frame will  cost us 10 add
>     instructions, and the pushes gain very little. I’d like to be able
>     to make the decision on whether we want to have a reserved frame
>     or pushes by considering the entire function. I don't think this
>     can be done in the context of PEI.
>
>     Note that in theory we could have both a reserved call frame and
>     have some specific call sites in the function use pushes, but this
>     is fairly tricky because it requires much more precise tracking of
>     the stack pointer state. That is something I’m not planning to
>     implement at this point.
>
>     2) Register allocation inefficiency:
>     Ideally, pushes can be used to make direct memory-to-memory movs,
>     freeing up registers, and saving quite a lot of code.
>
>     For example, for this (this is obviously a constructed example,
>     but code of this kind does exist in the wild):
>
>     void foo(int a, int b, int c, int d, int e, int f, int g, int h);
>
>     struct st { int arr[8]; };
>
>     void bar(struct st* p)
>
>     {
>
>       foo(p->arr[0], p->arr[1], p->arr[2], p->arr[3], p->arr[4],
>     p->arr[5], p->arr[6], p->arr[7]); }
>
>     We currently generate (with -m32 -O2) this:
>
>             pushl   %ebp
>
>             movl    %esp, %ebp
>
>             pushl   %ebx
>
>             pushl   %edi
>
>             pushl   %esi
>
>             subl    $44, %esp
>
>             movl    8(%ebp), %eax
>
>             movl    28(%eax), %ecx
>
>             movl    %ecx, -20(%ebp)         # 4-byte Spill
>
>             movl    24(%eax), %edx
>
>             movl    20(%eax), %esi
>
>             movl    16(%eax), %edi
>
>             movl    12(%eax), %ebx
>
>             movl    8(%eax), %ecx
>
>             movl    %ecx, -24(%ebp)         # 4-byte Spill
>
>             movl    (%eax), %ecx
>
>             movl    %ecx, -16(%ebp)         # 4-byte Spill
>
>             movl    4(%eax), %eax
>
>             movl    -20(%ebp), %ecx         # 4-byte Reload
>
>             movl    %ecx, 28(%esp)
>
>             movl    %edx, 24(%esp)
>
>             movl    %esi, 20(%esp)
>
>             movl    %edi, 16(%esp)
>
>             movl    %ebx, 12(%esp)
>
>             movl    -24(%ebp), %ecx         # 4-byte Reload
>
>             movl    %ecx, 8(%esp)
>
>             movl    %eax, 4(%esp)
>
>             movl    -16(%ebp), %eax         # 4-byte Reload
>
>             movl    %eax, (%esp)
>
>             calll   _foo
>
>             addl    $44, %esp
>
>             popl    %esi
>
>             popl    %edi
>
>             popl    %ebx
>
>             popl    %ebp
>
>             retl
>
>     Which is fairly horrible.
>
>     Some parameters get mov-ed up to four times - a mov from the
>     struct into a register, a register spill,  a reload, and finally a
>     mov onto the stack.
>
>     What we’d like to generate is something like:
>
>             pushl   %ebp
>
>             movl    %esp, %ebp
>
>             movl    8(%ebp), %eax
>
>             pushl   28(%eax)
>
>             pushl   24(%eax)
>
>             pushl   20(%eax)
>
>             pushl   16(%eax)
>
>             pushl   12(%eax)
>
>             pushl   8(%eax)
>
>             pushl   4(%eax)
>
>             pushl   (%eax)
>
>             calll   _foo
>
>             addl    $32, %esp
>
>             popl    %ebp
>
>             retl
>
>     To produce the code above, the transformation has to run
>     pre-reg-alloc. Otherwise, even if we fold loads into the push,
>     it's too late to recover from spills.
>
>     The direction I'd like to take with this is:
>
>     1) Add an X86-specific MachineFunctionPass that does the mov ->
>     push transformation and runs pre-reg-alloc.
>
>     It will:
>
>     * Make a decision on whether promoting some (or all) of the call
>     sites to use pushes is worth giving up on the reserved call frame.
>
>     * If it is, perform the mov ->push transformation for the selected
>     call sites.
>
>     * Fold loads into the pushes while doing the transformation.
>
>     As an alternative, I could try to teach the peephole optimizer to
>     do it (right now it won't even try to do this folding because
>     PUSHes store to memory), but getting it right in the general case
>     seems complex.
>
>     I think I'd rather do folding of the simple (but common) cases on
>     the fly.
>
>     2) Alter the semantics of ADJCALLSTACKDOWN/ADJCALLSTACKUP slightly.
>
>     Doing the mov->push transformation before PEI means I'd have to
>     leave the ADJCALLSTACKDOWN/UP pair unbalanced.
>
>     E.g. something like:
>
>     ADJCALLSTACKDOWN32 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>,
>     %ESP<imp-use>
>
>     %vreg9<def,dead> = COPY %ESP; GR32:%vreg9
>
>     PUSH32rmm %vreg0, 1, %noreg, 28, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 24, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 20, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 16, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 12, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 8, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0, 1, %noreg, 4, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     PUSH32rmm %vreg0<kill>, 1, %noreg, 0, %noreg, %ESP<imp-def>,
>     %ESP<imp-use>; GR32:%vreg0
>
>     CALLpcrel32 <ga:@foo>, <regmask>, %ESP<imp-use>, %ESP<imp-def>
>
>     ADJCALLSTACKUP32 32, 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>,
>     %ESP<imp-use>
>
>     This, rightly, gets flagged by the verifier.
>
>     My proposal is to add an additional parameter to ADJCALLSTACKDOWN
>     to express the amount of adjustment the call sequence itself does.
>     This is somewhat similar to the second parameter of ADKCALLSTACKUP
>     which allows adjustment for callee stack-clean-up.
>
>     So, in this case, we will get a "ADJCALLSTACKDOWN32 32, 32"
>     instead of the “ADJCALLSTACKDOWN32 0”. The verifier will be happy,
>     and PEI will know it doesn't need to do any stack pointer adjustment.
>
>     Does this sound like the right approach?
>
>     Any suggestions, as well as warnings of potential pitfalls, are
>     welcome. :-)
>
>     Thanks,
>
>        Michael
>
>     ---------------------------------------------------------------------
>     Intel Israel (74) Limited
>
>     This e-mail and any attachments may contain confidential material for
>     the sole use of the intended recipient(s). Any review or distribution
>     by others is strictly prohibited. If you are not the intended
>     recipient, please contact the sender and delete all copies.
>
>
>
>
>     _______________________________________________
>
>     LLVM Developers mailing list
>
>     LLVMdev at cs.uiuc.edu  <mailto:LLVMdev at cs.uiuc.edu>          http://llvm.cs.uiuc.edu
>
>     http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141221/8ee6597e/attachment.html>