[PATCH] [X86] Convert esp-relative movs of function arguments to pushes, step 2

Michael Kuperstein michael.m.kuperstein at intel.com
Tue Jan 13 13:38:24 PST 2015


Thanks, Reid!

Waiting for the second part, you didn't get to the really horrible stuff yet...

Regarding the high level, two reasons:

1. It seemed like it was going to be simpler. I'm not so sure anymore, but I still think it is. (Note that we'll still need to fix all of the code that tracks SP adjustment, that's not going away in either case).
2. The main problem is that next step after this is going to be a function-scope heuristic. To use this transformation for even one call-site, I have to disable the reserved frame for the whole function. So, I need to try to approximate the impact on the whole function (which contains some calls that will be converted to use pushes, and some calls that won't be). I don't see how this can be done on the DAG level.


================
Comment at: lib/CodeGen/PrologEpilogInserter.cpp:855-856
@@ +854,4 @@
+    // Note that this must come after eliminateFrameIndex, because 
+    // if I itself referred to a frame index, we shouldn't count its own
+    // adjustment.
+    if (MI && InsideCallSequence)
----------------
rnk wrote:
> This seems like an x86-specific quirk, right? Given "push [esp + 8]", x86 chips will load [esp + 8] before adjusting esp, and I think this code motion accomplishes that.
> 
> I'm OK with that motion so long as there are no other upstream LLVM backends with CISC-y instructions like "push [SP-mem]". :)
This call to SPAdjust() always returns 0 right now (barring the code in this patch), it was added as part of my refactoring in D6863, and I added it in the wrong place.

The motivation here wasn't a push, actually, since I try to never generate push [esp + 8], that's filtered out by the code in the optimization. Although I can probably start generating them - I was trying to filter them out precisely because I didn't want all of this complexity at the first stage, but apparently it's necessary.

The problem is that once we don't have a reserved call frame (regardless of the push transformation), you can have things like
CALL32r <fi#1>, where the call is callee-pop.
So you need to resolve the indirect call using the stack-pointer from before the call.

================
Comment at: lib/Target/X86/X86ConvertMovsToPushes.cpp:82-83
@@ +81,4 @@
+
+  // We currently only support call sequences where *all* parameters.
+  // are passed on the stack.
+  // No point in running this in 64-bit mode, since some arguments are
----------------
rnk wrote:
> I think it's important to at least support __thiscall eventually, since that's a very common convention with one regparm.
Yes, and maybe even for _fastcall (It looks like gcc will do this for fastcall, icc won't). 
But I am still trying to do this gradually, to the extent that I can. :-)

================
Comment at: lib/Target/X86/X86ConvertMovsToPushes.cpp:84-86
@@ +83,5 @@
+  // are passed on the stack.
+  // No point in running this in 64-bit mode, since some arguments are
+  // passed in-register in all common calling conventions, so the pattern
+  // we're looking for will never match.
+  const X86Subtarget &STI = MF.getTarget().getSubtarget<X86Subtarget>();
----------------
rnk wrote:
> I guess I would justify this more in terms of reducing the extra CFI that we would have to emit to describe the SP adjustments. Converting a few movs to pushes isn't worth the complexity.
You're right, that too.

================
Comment at: lib/Target/X86/X86ConvertMovsToPushes.cpp:143
@@ +142,3 @@
+
+  // Stack re-alignment can make this unprofitable even in terms of size.
+  // As mentioned above, a better heuristic is needed. For now, don't do this
----------------
rnk wrote:
> Can you explain why this is unprofitable? I guess if we get here we are in dyanamic alloca plus stack realignment land, i.e. the worst thing that could possibly happen. Is this about extra code for preserving the outgoing stack alignment then? Like on Linux, where we provide 16 byte stack alignment?
If we get here, we're in opt-for-size + stack-realignment land.

And, yes, that's exactly what it is is about.
If you are passing only one parameter, the original code would be:

```
mov %eax, 128(%esp)
call $foo
```

Without re-alignment, you have
```
push %eax
call $foo
add $4, %esp
```
which is still a win in terms of code-size

With re-alignment, you get:
```
sub $16, %esp
push %eax
call $foo
add $12, %esp
```
Which is... questionable. The code size for the sequence is the same (in this case, 7 bytes for both, not including the call), but if you have other call sites which you didn't convert, you may actually lose. And, of course, you lose performance (3 instructions instead of 1) without anything to show for it.

Once there is a heuristic that tries to estimate the overhead, we can address this on a case-by-case basis (e.g. if we have 16-byte stack re-alignment, but most call-sites have a lot of parameters, then it's still worth it.)

http://reviews.llvm.org/D6789

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/






More information about the llvm-commits mailing list