[LLVMdev] [RFC] [X86] Mov to push transformation in x86-32 call sequences
Herbie Robinson
HerbieRobinson at verizon.net
Sun Dec 21 14:11:41 PST 2014
On 12/21/14 4:27 AM, Kuperstein, Michael M wrote:
>
> Which performance guidelines are you referring to?
>
Table C-21 in "Intel® 64 and IA-32 Architectures Optimization Reference
Manual", September 2014.
It hasn't changed. It still lists push and pop instructions as 2-3
times more expensive as mov. And that's not taking into account any
optimizations that might get undone by the stack pointer changing. I'm
just speculating, but I suspect that move being faster has to do with
not having to modify a register every time.
With that as a basis, the fastest entry/exit sequences just use
subl/addl on the stack pointer and don't use push at all. For most C
functions, you don't even have to materialize a frame pointer (if the
unwind mechanisms are set up to handle that). Not that I am
recommending changing the x86_32 code generation to do that.
>
> I’m not that familiar with decade-old CPUs, but to the best of my
> knowledge, this is not true on current hardware.
>
> There is one specific circumstance where PUSHes should be avoided –
> for Atom/Silvermont processors, the memory form of PUSH is
> inefficient, so the register-freeing optimization below may not be
> profitable (see 14.3.3.6 and 15.3.1.2 in the Intel optimization
> reference manual).
>
> Having said that, one distinct possibility is to have the heuristic
> make different decisions depending on the optimization flags, that is
> be much more aggressive for optsize functions.
>
> *From:*Herbie Robinson [mailto:HerbieRobinson at verizon.net]
> *Sent:* Sunday, December 21, 2014 10:58
> *To:* Kuperstein, Michael M; LLVMdev at cs.uiuc.edu
> *Subject:* Re: [LLVMdev] [RFC] [X86] Mov to push transformation in
> x86-32 call sequences
>
> According to the Intel performance guidelines, pushes are
> significantly slower than moves to the extent they should be avoided
> as much as possible. It's been a decade since I was dealing with
> this; so, I don't remember the numbers, but I'm pretty sure the
> changes you are proposing will slow the code down.
>
> People who care about speed more than code size are probably not going
> to like this very much...
>
>
> On 12/21/14 3:17 AM, Kuperstein, Michael M wrote:
>
> Hello all,
>
> In r223757 I’ve committed a patch that performs, for the 32-bit
> x86 calling convention, the transformation of MOV instructions
> that push function arguments onto the stack into actual PUSH
> instructions.
>
> For example, it will transform this:
>
> subl $16, %esp
>
> movl $4, 12(%esp)
>
> movl $3, 8(%esp)
>
> movl $2, 4(%esp)
>
> movl $1, (%esp)
>
> calll _func
>
> addl $16, %esp
>
> Into this:
>
> pushl $4
>
> pushl $3
>
> pushl $2
>
> pushl $1
>
> calll _func
>
> addl $16, %esp
>
> The main motivation for this is code size (a “pushl $4” is 2
> bytes, a “movl $4, 12(%esp)” is 7 bytes), but there are some other
> advantages, as shown below.
>
> The way this works in r223757 is by intercepting call frame
> simplification in the Prolog/Epilog Inserter, and replacing the
> mov sequence with pushes. Right now it only handles functions
> which do not have a reserved call frame (a small minority of
> cases), and I'd like to extend it to cover other cases where it is
> profitable.
>
> The currently implemented approach has a couple of drawbacks:
>
> 1) Push vs. having a reserved call frame:
> This transformation is always profitable when we do not have a
> reserved call frame. When a reserved frame can be used, however,
> there is a trade-off. For example, in a function that contains
> only one call site, and no other stack allocations, pushes are a
> clear win, since having a reserved call frame wouldn't save any
> instructions. On the other hand, if a function contains 10 call
> sites, and only one of them can use pushes, then it is most
> probably a loss – not reserving a call frame will cost us 10 add
> instructions, and the pushes gain very little. I’d like to be able
> to make the decision on whether we want to have a reserved frame
> or pushes by considering the entire function. I don't think this
> can be done in the context of PEI.
>
> Note that in theory we could have both a reserved call frame and
> have some specific call sites in the function use pushes, but this
> is fairly tricky because it requires much more precise tracking of
> the stack pointer state. That is something I’m not planning to
> implement at this point.
>
> 2) Register allocation inefficiency:
> Ideally, pushes can be used to make direct memory-to-memory movs,
> freeing up registers, and saving quite a lot of code.
>
> For example, for this (this is obviously a constructed example,
> but code of this kind does exist in the wild):
>
> void foo(int a, int b, int c, int d, int e, int f, int g, int h);
>
> struct st { int arr[8]; };
>
> void bar(struct st* p)
>
> {
>
> foo(p->arr[0], p->arr[1], p->arr[2], p->arr[3], p->arr[4],
> p->arr[5], p->arr[6], p->arr[7]); }
>
> We currently generate (with -m32 -O2) this:
>
> pushl %ebp
>
> movl %esp, %ebp
>
> pushl %ebx
>
> pushl %edi
>
> pushl %esi
>
> subl $44, %esp
>
> movl 8(%ebp), %eax
>
> movl 28(%eax), %ecx
>
> movl %ecx, -20(%ebp) # 4-byte Spill
>
> movl 24(%eax), %edx
>
> movl 20(%eax), %esi
>
> movl 16(%eax), %edi
>
> movl 12(%eax), %ebx
>
> movl 8(%eax), %ecx
>
> movl %ecx, -24(%ebp) # 4-byte Spill
>
> movl (%eax), %ecx
>
> movl %ecx, -16(%ebp) # 4-byte Spill
>
> movl 4(%eax), %eax
>
> movl -20(%ebp), %ecx # 4-byte Reload
>
> movl %ecx, 28(%esp)
>
> movl %edx, 24(%esp)
>
> movl %esi, 20(%esp)
>
> movl %edi, 16(%esp)
>
> movl %ebx, 12(%esp)
>
> movl -24(%ebp), %ecx # 4-byte Reload
>
> movl %ecx, 8(%esp)
>
> movl %eax, 4(%esp)
>
> movl -16(%ebp), %eax # 4-byte Reload
>
> movl %eax, (%esp)
>
> calll _foo
>
> addl $44, %esp
>
> popl %esi
>
> popl %edi
>
> popl %ebx
>
> popl %ebp
>
> retl
>
> Which is fairly horrible.
>
> Some parameters get mov-ed up to four times - a mov from the
> struct into a register, a register spill, a reload, and finally a
> mov onto the stack.
>
> What we’d like to generate is something like:
>
> pushl %ebp
>
> movl %esp, %ebp
>
> movl 8(%ebp), %eax
>
> pushl 28(%eax)
>
> pushl 24(%eax)
>
> pushl 20(%eax)
>
> pushl 16(%eax)
>
> pushl 12(%eax)
>
> pushl 8(%eax)
>
> pushl 4(%eax)
>
> pushl (%eax)
>
> calll _foo
>
> addl $32, %esp
>
> popl %ebp
>
> retl
>
> To produce the code above, the transformation has to run
> pre-reg-alloc. Otherwise, even if we fold loads into the push,
> it's too late to recover from spills.
>
> The direction I'd like to take with this is:
>
> 1) Add an X86-specific MachineFunctionPass that does the mov ->
> push transformation and runs pre-reg-alloc.
>
> It will:
>
> * Make a decision on whether promoting some (or all) of the call
> sites to use pushes is worth giving up on the reserved call frame.
>
> * If it is, perform the mov ->push transformation for the selected
> call sites.
>
> * Fold loads into the pushes while doing the transformation.
>
> As an alternative, I could try to teach the peephole optimizer to
> do it (right now it won't even try to do this folding because
> PUSHes store to memory), but getting it right in the general case
> seems complex.
>
> I think I'd rather do folding of the simple (but common) cases on
> the fly.
>
> 2) Alter the semantics of ADJCALLSTACKDOWN/ADJCALLSTACKUP slightly.
>
> Doing the mov->push transformation before PEI means I'd have to
> leave the ADJCALLSTACKDOWN/UP pair unbalanced.
>
> E.g. something like:
>
> ADJCALLSTACKDOWN32 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>,
> %ESP<imp-use>
>
> %vreg9<def,dead> = COPY %ESP; GR32:%vreg9
>
> PUSH32rmm %vreg0, 1, %noreg, 28, %noreg, %ESP<imp-def>,
> %ESP<imp-use>; GR32:%vreg0
>
> PUSH32rmm %vreg0, 1, %noreg, 24, %noreg, %ESP<imp-def>,
> %ESP<imp-use>; GR32:%vreg0
>
> PUSH32rmm %vreg0, 1, %noreg, 20, %noreg, %ESP<imp-def>,
> %ESP<imp-use>; GR32:%vreg0
>
> PUSH32rmm %vreg0, 1, %noreg, 16, %noreg, %ESP<imp-def>,
> %ESP<imp-use>; GR32:%vreg0
>
> PUSH32rmm %vreg0, 1, %noreg, 12, %noreg, %ESP<imp-def>,
> %ESP<imp-use>; GR32:%vreg0
>
> PUSH32rmm %vreg0, 1, %noreg, 8, %noreg, %ESP<imp-def>,
> %ESP<imp-use>; GR32:%vreg0
>
> PUSH32rmm %vreg0, 1, %noreg, 4, %noreg, %ESP<imp-def>,
> %ESP<imp-use>; GR32:%vreg0
>
> PUSH32rmm %vreg0<kill>, 1, %noreg, 0, %noreg, %ESP<imp-def>,
> %ESP<imp-use>; GR32:%vreg0
>
> CALLpcrel32 <ga:@foo>, <regmask>, %ESP<imp-use>, %ESP<imp-def>
>
> ADJCALLSTACKUP32 32, 0, %ESP<imp-def>, %EFLAGS<imp-def,dead>,
> %ESP<imp-use>
>
> This, rightly, gets flagged by the verifier.
>
> My proposal is to add an additional parameter to ADJCALLSTACKDOWN
> to express the amount of adjustment the call sequence itself does.
> This is somewhat similar to the second parameter of ADKCALLSTACKUP
> which allows adjustment for callee stack-clean-up.
>
> So, in this case, we will get a "ADJCALLSTACKDOWN32 32, 32"
> instead of the “ADJCALLSTACKDOWN32 0”. The verifier will be happy,
> and PEI will know it doesn't need to do any stack pointer adjustment.
>
> Does this sound like the right approach?
>
> Any suggestions, as well as warnings of potential pitfalls, are
> welcome. :-)
>
> Thanks,
>
> Michael
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
>
>
> _______________________________________________
>
> LLVM Developers mailing list
>
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu> http://llvm.cs.uiuc.edu
>
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141221/8ee6597e/attachment.html>
More information about the llvm-dev
mailing list