[PATCH] D20263: X86: Avoid using _chkstk when lowering WIN_ALLOCA instructions

Mon May 16 16:39:22 PDT 2016

hans marked 18 inline comments as done.
hans added a comment.

Thanks for the great comments everyone!

David, I agree this only addresses a small part of the problem and leaves a lot of performance on the table. The way I think about it is as an incremental improvement, easier than tackling the full inalloca problem all at once, and also helpful for dynamic allocas in general.

Having the WIN_ALLOCA pseudos expanded late might also help getting better code for these calls eventually, as it should remove the need for some of the pattern matching I tried in http://reviews.llvm.org/D20003 (unless we manage to fix them with some higher-level approach).


================
Comment at: include/llvm/CodeGen/MachineFrameInfo.h:282
@@ +281,3 @@
+  /// Whether the function has WIN_ALLOCA instructions.
+  bool HasWinAlloca = false;
+
----------------
rnk wrote:
> I think WIN_ALLOCA is x86-specific, so a better place for this would be lib/Target/X86/X86MachineFunctionInfo.h
Done.

================
Comment at: lib/Target/X86/X86.h:62
@@ -61,1 +61,3 @@
 
+/// XXX
+FunctionPass *createX86WinAllocaExpander();
----------------
rnk wrote:
> mkuper wrote:
> > I think this needs a description. :-)
> write the comment :)
Done.

================
Comment at: lib/Target/X86/X86ISelLowering.cpp:16606
@@ -16609,1 +16605,3 @@
+    unsigned SizeReg = MRI.createVirtualRegister(getRegClassFor(SPTy));
+    Chain = DAG.getCopyToReg(Chain, dl, SizeReg, Size);
     SDVTList NodeTys = DAG.getVTList(MVT::Other, MVT::Glue);
----------------
rnk wrote:
> mkuper wrote:
> > Do we still need a CopyToReg? The previous code had it because it needed specifically a copy to EAX glued to the future-chkstk. Can you use Size directly?
> I wonder if we could do something clever to avoid creating a copy of an immediate. Might be too much work.
We can avoid creating the copy here, as Michael pointed out, but since the MI instruction expects a register (because the amount might not be a constant), there will spill be a copy at the MI level.

We could create a variant of the pseudo-instruction that takes an immediate, but I'm not sure that would end up being a simplification.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:50
@@ +49,3 @@
+  /// Compute which lowering to use for each WinAlloca instruction.
+  LoweringMap computeLowerings(MachineFunction &MF);
+
----------------
mkuper wrote:
> DavidKreitzer wrote:
> > Do you really want to return a LoweringMap? This will cause the entire container to be copied. Maybe pass in a LoweringMap& and fill it in instead?
> Won't this be ok with RVO?
I figured the call would be inlined anyway, and the compiler could avoid the copy.

Anyway, I've changed the call to return via a reference parameter instead, as that seems more idiomatic.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:99
@@ +98,3 @@
+X86WinAllocaExpander::getLowering(int64_t CurrentOffset,
+                                   int64_t AllocaAmount) {
+  // For a non-constant amount or a large amount, we have to probe.
----------------
DavidKreitzer wrote:
> The indentation looks off here.
Done.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:101
@@ +100,3 @@
+  // For a non-constant amount or a large amount, we have to probe.
+  if (AllocaAmount == -1 || AllocaAmount > StackProbeSize)
+    return Probe;
----------------
DavidKreitzer wrote:
> To be extra safe, you could assert that AllocaAmount is >= -1.
Hmm, I'll just change it to check AllocaAmount < 0 in the condition instead.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:109
@@ +108,3 @@
+  // Otherwise, touch the current tip of the stack, then subtract.
+  assert(AllocaAmount <= StackProbeSize);
+  return TouchAndSub;
----------------
DavidKreitzer wrote:
> This assertion seems unnecessary given line 101.
Removed.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:116
@@ +115,3 @@
+  // Do a one-pass reverse post-order walk of the CFG to conservatively estimate
+  // the offset between the stack pointer and the lowerst touched part of the
+  // stack, and use that to decide how to lower each WinAlloca instruction.
----------------
rnk wrote:
> s/lowerst/lowest/
Done.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:119
@@ +118,3 @@
+
+  // Compute the reverse post-order.
+  SmallVector<MachineBasicBlock*, 16> RPO;
----------------
rnk wrote:
> This can be:
>   ReversePostOrderTraversal<MachineFunction *> RPOT(&MF);
Done.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:126
@@ +125,3 @@
+  // Initialize Out[B], the stack offset at exit from B, to something big.
+  DenseMap<MachineBasicBlock *, int64_t> Out;
+  for (MachineBasicBlock *MBB : RPO)
----------------
DavidKreitzer wrote:
> "Out" isn't very descriptive. Can you name this differently, e.g. OutOffset?
Done.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:152
@@ +151,3 @@
+        Lowerings[&MI] = L;
+        switch (L) {
+        case Sub:
----------------
DavidKreitzer wrote:
> You should probably add a default: notreached().
I left it out intentionally, relying on -Wswitch to detect any missing cases: http://llvm.org/docs/CodingStandards.html#don-t-use-default-labels-in-fully-covered-switches-over-enumerations

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:166
@@ +165,3 @@
+      // A stackrestore makes the offset unknown.
+      if (MI.isCopy() && MI.getOperand(0).isReg() &&
+          MI.getOperand(0).getReg() == StackPtr)
----------------
mkuper wrote:
> Are you sure this is conservative enough?
> Perhaps blacklist anything that defs the stack pointer, except for a specific whitelist (calls, pushes, pops...) that touches the tip? Or would we get a huge whitelist?
I couldn't come up with any example code that would break this in practice. But you're right, a white-list would be better, and I don't think it will be big.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:168
@@ +167,3 @@
+          MI.getOperand(0).getReg() == StackPtr)
+        Offset = -1;
+    }
----------------
rnk wrote:
> Should this be INT32_MAX instead? It seems like we could do alloca, save+restore, alloca and run over the guard page this way. Maybe a good test case?
Oops, -1 was a leftover from an earlier version of the patch. I've added a test.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:197
@@ +196,3 @@
+  if (Amount != -1) {
+    assert((Amount % (Is64Bit ? 8 : 4) == 0) && "Stack would not be aligned!");
+  }
----------------
mkuper wrote:
> rnk wrote:
> > Maybe cache `Is64Bit ? 8 : 4` as a `SlotSize` member variable. Also RAX as... some variable. RegA?
> We can probably just query MRI for getSlotSize instead.
> 
Done.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:197
@@ +196,3 @@
+  if (Amount != -1) {
+    assert((Amount % (Is64Bit ? 8 : 4) == 0) && "Stack would not be aligned!");
+  }
----------------
hans wrote:
> mkuper wrote:
> > rnk wrote:
> > > Maybe cache `Is64Bit ? 8 : 4` as a `SlotSize` member variable. Also RAX as... some variable. RegA?
> > We can probably just query MRI for getSlotSize instead.
> > 
> Done.
Done.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:263
@@ +262,3 @@
+  if (MF.getFunction()->hasFnAttribute("stack-probe-size")) {
+    MF.getFunction()
+        ->getFnAttribute("stack-probe-size")
----------------
mkuper wrote:
> StackProbeSize = MF.... ?
> (Perhaps add a test for this?)
It's passed as an argument to .getAsInteger below. Or are you saying there's some shorter way to get at it?

Added a test.


http://reviews.llvm.org/D20263