[PATCH] D20263: X86: Avoid using _chkstk when lowering WIN_ALLOCA instructions

Mon May 16 15:39:18 PDT 2016

DavidKreitzer added a comment.

Hi Hans,

I made a few detailed comments, but I have higher level concerns about the approach. This new pass will eliminate most of the _chkstk calls, which is great. But we'll still be leaving some performance on the table. We'd really like calls that take inalloca arguments to be optimized in all the same ways as other calls. In particular, we'd like to be able to reserve the outgoing argument block & do the store-to-push optimization for outgoing arguments.

I think the "inalloca" name is unfortunate, because these "WIN_ALLOCA" stack allocations have more in common with ADJCALLSTACKDOWN than they do with alloca. For one thing, they always allocate a fixed amount of space. Additionally, there is an implicit requirement that the stack pointer value immediately following the WIN_ALLOCA matches the stack pointer value at the time of the call to the inalloca function. This requirement is not currently enforced correctly in some cases. You'll find that this program doesn't work with clang on Win32, because the stack space for the _alloca call is allocated *after* the stack space for the "WIN_ALLOCA":

  #include <stdio.h>
  #include <malloc.h>
  struct X
  {
      int x;
      X() {x = 42;}
      X(const X&v) { x = v.x; }
  };

  X x;

  void f2(X v, void*p) { printf("v.x = %d (should be 42)\n", v.x); }
  void f(int n)
  {
      f2(x, _alloca(n));
  }

  int main()
  {
      f(1000);
      return 0;
  }

Fixing that bug may be orthogonal to how we model & implement inalloca calls in the back end. But that illustrates why I think a more accurate modeling would be to move the ADJCALLSTACKDOWN instruction that precedes the inalloca call (i.e. the ADJCALLSTACKDOWN instruction that is hacked to always use an allocation size of 0) to the current location of the WIN_ALLOCA, use an accurate allocation size rather than 0, and get rid of the WIN_ALLOCA altogether. That would have the effect of introducing nested ADJCALLSTACKDOWN/ADJCALLSTACKUP structures, which I'm sure is going to cause some implementation problems. But I think it will allow us to more naturally handle & optimize these calls during FrameLowering. It's kind of mess, but I think it is just a natural consequence of Microsoft's choice to construct inalloca arguments in the outgoing argument block rather than in an arbitrary location on the stack and passing a pointer to that location as is done on Linux.

I'll try to put together a prototype. That might make it easier to see what I have in mind.

Meanwhile, I have no objection to using this new pass as a shorter term performance fix. :)

Thanks,
-Dave

BTW, any thoughts on how to fix the above bug? This is another thing that I think ought to be handled by clang.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:50
@@ +49,3 @@
+  /// Compute which lowering to use for each WinAlloca instruction.
+  LoweringMap computeLowerings(MachineFunction &MF);
+
----------------
Do you really want to return a LoweringMap? This will cause the entire container to be copied. Maybe pass in a LoweringMap& and fill it in instead?

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:99
@@ +98,3 @@
+X86WinAllocaExpander::getLowering(int64_t CurrentOffset,
+                                   int64_t AllocaAmount) {
+  // For a non-constant amount or a large amount, we have to probe.
----------------
The indentation looks off here.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:101
@@ +100,3 @@
+  // For a non-constant amount or a large amount, we have to probe.
+  if (AllocaAmount == -1 || AllocaAmount > StackProbeSize)
+    return Probe;
----------------
To be extra safe, you could assert that AllocaAmount is >= -1.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:109
@@ +108,3 @@
+  // Otherwise, touch the current tip of the stack, then subtract.
+  assert(AllocaAmount <= StackProbeSize);
+  return TouchAndSub;
----------------
This assertion seems unnecessary given line 101.

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:126
@@ +125,3 @@
+  // Initialize Out[B], the stack offset at exit from B, to something big.
+  DenseMap<MachineBasicBlock *, int64_t> Out;
+  for (MachineBasicBlock *MBB : RPO)
----------------
"Out" isn't very descriptive. Can you name this differently, e.g. OutOffset?

================
Comment at: lib/Target/X86/X86WinAllocaExpander.cpp:152
@@ +151,3 @@
+        Lowerings[&MI] = L;
+        switch (L) {
+        case Sub:
----------------
You should probably add a default: notreached().

http://reviews.llvm.org/D20263