[PATCH] D20003: X86CallFrameOpt: a first step towards optimizing inalloca calls (PR27076)

Mon May 9 12:03:28 PDT 2016

rnk requested changes to this revision.
rnk added a comment.
This revision now requires changes to proceed.

I think special casing inalloca allocas in X86TargetLowering::LowerDYNAMIC_STACKALLOC will be a better start for this.

---

In http://reviews.llvm.org/D20003#423533, @DavidKreitzer wrote:

> I think X86CallFrameOptimization is the wrong place to be trying to eliminate the _chkstk calls for inalloca. The ability to do the store-to-push optimization has no bearing on whether the chkstk call is needed. Your comment about the pushes naturally probing the stack is true, but it is also true of the original stores.


I agree, we should avoid the chkstk at a higher level. Eventually, though, we wanted to do general conversion of inalloca to a sequence of subs, leas, and pushes. I figured that would live here.

> I think David had the right idea in r262370. But I assume the problem he ran into is that clang is nesting these inalloca calls, so it isn't easy to tell how much stack space is ultimately going to be allocated. Consider a case like this:

>  ... snip

>  See how there are two inalloca calls at the top? They effectively grow the stack by 6000 bytes (which is big enough to require _chkstk even though the separate 3000-byte allocations are not). Neither MSVC nor ICC do it like this. They both allocate space for the call to f2, call f2, and cleanup the stack from calling f2 before allocating space for the call to f1. I think we need to fix clang to do something similar and then David's solution ought to work.


We don't want to try to follow ICC or MSVC here. We want to do the copy elision instead, since it's actually required in C++17. IMO we should detect these large argument allocation cases and probe the stack.

> For the example in pr27076, another thing clang could do is just avoid inalloca altogether. We shouldn't need it for passing objects that use the default copy constructor.


inalloca is used whenever a type is not trivially copyable, meaning it has a non-trivial copy constructor or destructor. My understanding is that we aren't allowed to introduce extra copies of non-trivially copyable objects, so we have to use inalloca here, at least in C++17.


================
Comment at: lib/Target/X86/X86CallFrameOptimization.cpp:56
@@ +55,3 @@
+  // Information about the setup for an inalloca call.
+  struct InAllocaInfo {
+    // Frame setup for the _chkstk call.
----------------
This makes me feel like we should model argument allocation as a single operation instead of five or so. What do you think about changing X86TargetLowering::LowerDYNAMIC_STACKALLOC to emit a new DAG node which selects to a single MI?

================
Comment at: lib/Target/X86/X86CallFrameOptimization.cpp:349
@@ -316,1 +348,3 @@
 
+bool X86CallFrameOptimization::matchInAlloca(MachineBasicBlock::iterator &I,
+                                             InAllocaInfo &Info,
----------------
Yeah, this feels like too much pattern matching.


http://reviews.llvm.org/D20003