[PATCH] D80951: [GlobalOpt] Remove preallocated calls when possible

Mon Jun 22 13:59:44 PDT 2020

aeubanks marked an inline comment as done.
aeubanks added inline comments.

================
Comment at: llvm/lib/Transforms/IPO/GlobalOpt.cpp:2334
+
+    // FIXME: This doesn't handle invoke
+    Builder.SetInsertPoint(NewCB->getNextNonDebugInstruction());
----------------
efriedma wrote:
> aeubanks wrote:
> > aeubanks wrote:
> > > efriedma wrote:
> > > > aeubanks wrote:
> > > > > efriedma wrote:
> > > > > > aeubanks wrote:
> > > > > > > efriedma wrote:
> > > > > > > > "This doesn't handle invoke" is rough...
> > > > > > > > 
> > > > > > > > In the normal destination, you can stick a stackrestore as the first instruction (breaking the critical edge if necessary).  In the unwind destination, I'm not sure.  For Itanium-style unwinding, you could just stick it immediately after the landingpad, but I'm not sure what you can do on Windows; you can't stackrestore in a funclet.
> > > > > > > > 
> > > > > > > > I guess ultimately, we really want to use a static alloca (in the entry block) in most cases, but it's not safe in all cases, as we've discussed before.
> > > > > > > > 
> > > > > > > > At the very least, this probably should bail out somehow, not crash.
> > > > > > > Bailed out for any invokes.
> > > > > > > 
> > > > > > > I've been trying to understand how inalloca handles this and still don't quite understand.
> > > > > > > 
> > > > > > > 
> > > > > > > ```
> > > > > > > void bar2() {
> > > > > > >  for (int i = 0; i < 1000000; ++i) {
> > > > > > >   try {
> > > > > > >    foo(A(), 5);
> > > > > > >   } catch (int) {}
> > > > > > >  }
> > > > > > > }
> > > > > > > ```
> > > > > > > 
> > > > > > > becomes
> > > > > > > 
> > > > > > > ```
> > > > > > > define dso_local void @"?bar2@@YAXXZ"() local_unnamed_addr #0 personality i8* bitcast (i32 (...)* @__CxxFrameHandler3 to i8*) {
> > > > > > > entry:
> > > > > > >   %agg.tmp.ensured = alloca %struct.A, align 8
> > > > > > >   br label %for.body
> > > > > > > 
> > > > > > > for.cond.cleanup:                                 ; preds = %for.inc
> > > > > > >   ret void
> > > > > > > 
> > > > > > > for.body:                                         ; preds = %for.inc, %entry
> > > > > > >   %i.05 = phi i32 [ 0, %entry ], [ %inc, %for.inc ]
> > > > > > >   %inalloca.save = call i8* @llvm.stacksave()
> > > > > > >   %argmem = alloca inalloca <{ %struct.A*, %struct.A, i32 }>, align 4
> > > > > > >   %0 = getelementptr inbounds <{ %struct.A*, %struct.A, i32 }>, <{ %struct.A*, %struct.A, i32 }>* %argmem, i32 0, i32 1
> > > > > > >   %call = call x86_thiscallcc %struct.A* @"??0A@@QAE at XZ"(%struct.A* nonnull %0) #2
> > > > > > >   %1 = getelementptr inbounds <{ %struct.A*, %struct.A, i32 }>, <{ %struct.A*, %struct.A, i32 }>* %argmem, i32 0, i32 0
> > > > > > >   store %struct.A* %agg.tmp.ensured, %struct.A** %1, align 4
> > > > > > >   %2 = getelementptr inbounds <{ %struct.A*, %struct.A, i32 }>, <{ %struct.A*, %struct.A, i32 }>* %argmem, i32 0, i32 2
> > > > > > >   store i32 5, i32* %2, align 4, !tbaa !3
> > > > > > >   %call1 = invoke %struct.A* @"?foo@@YA?AUA@@U1 at H@Z"(<{ %struct.A*, %struct.A, i32 }>* inalloca nonnull %argmem)
> > > > > > >           to label %invoke.cont unwind label %catch.dispatch
> > > > > > > 
> > > > > > > catch.dispatch:                                   ; preds = %for.body
> > > > > > >   %3 = catchswitch within none [label %catch] unwind to caller
> > > > > > > 
> > > > > > > catch:                                            ; preds = %catch.dispatch
> > > > > > >   %4 = catchpad within %3 [%rtti.TypeDescriptor2* @"??_R0H at 8", i32 0, i8* null]
> > > > > > >   catchret from %4 to label %for.inc
> > > > > > > 
> > > > > > > for.inc:                                          ; preds = %invoke.cont, %catch
> > > > > > >   %inc = add nuw nsw i32 %i.05, 1
> > > > > > >   %exitcond = icmp eq i32 %inc, 1000000
> > > > > > >   br i1 %exitcond, label %for.cond.cleanup, label %for.body
> > > > > > > 
> > > > > > > invoke.cont:                                      ; preds = %for.body
> > > > > > >   call void @llvm.stackrestore(i8* %inalloca.save)
> > > > > > >   call x86_thiscallcc void @"??1A@@QAE at XZ"(%struct.A* nonnull %agg.tmp.ensured) #2
> > > > > > >   br label %for.inc
> > > > > > > }
> > > > > > > ```
> > > > > > > 
> > > > > > > The `@llvm.stackrestore()` only gets called in the non-exceptional case, but I verified that this doesn't leak stack memory when `foo()` always throws. I'm not sure how that happens.
> > > > > > You're not testing what you want to test; the problem isn't when the function with the inalloca argument itself throws. The behavior is obvious in that case: the alloca gets consumed by the call.  The issue is what happens if an exception gets thrown before that.  Expanding out your example:
> > > > > > 
> > > > > > ```
> > > > > > struct A { A() { throw 1; } ~A(); };
> > > > > > void foo(A a, int z); // Never gets called
> > > > > > void bar() {
> > > > > >  for (int i = 0; i < 1000000; ++i) {
> > > > > >   try {
> > > > > >    foo(A(), 5);
> > > > > >   } catch (int) {}
> > > > > >  }
> > > > > > }
> > > > > > ```
> > > > > > 
> > > > > > I'm pretty sure with inalloca, the memory just leaks.  I guess preallocated has the same problem,
> > > > > > 
> > > > > > Not sure how we solve it with preallocated.  I guess in the backend, we need to apply some sort of adjustment after a catchret.  Not sure how we compute the size of that adjustment, though.  I guess we need to note on each catchret/cleanupret which allocations die at that point.
> > > > > I realized that at some point and I've been trying out different variations of `A()` and `foo()` both randomly throwing, plus nested calls to `foo()`, but it seems like there's never a stack leak with inalloca with `@llvm.stackrestore`s removed. The `@llvm.stackrestore` only appears in the non-exceptional path, so it can't possibly help prevent stack leaks due to exceptions? And in the non-exceptional case the stack will get cleaned up as expected. So I'm not sure that the stack restores are necessary for inalloca?
> > > > > 
> > > > > I am still trying to understand how stack cleanups work in an exception handler though.
> > > > > 
> > > > > ```
> > > > > void bar() {
> > > > >         for (int i = 0; i < 1000000; ++i) {
> > > > >                 try {
> > > > >                         foo(foo(A(), A(), 5), A(), 6);
> > > > >                 } catch (int) {
> > > > >                 }
> > > > >         }
> > > > > }
> > > > > ```
> > > > Try the following for some interesting results at -O0.
> > > > 
> > > > ```
> > > > #include <cstdio>
> > > > #define INLINE __attribute((always_inline))
> > > > //#define INLINE __attribute((noinline))
> > > > struct A {
> > > > __attribute((optnone)) A(int) {z[1]=10;}
> > > > __attribute((optnone)) A() { z[1]=5; throw 1; }
> > > > __attribute((optnone)) ~A() { printf("%d\n", z[1]); }
> > > > int z[10000];
> > > > };
> > > > __attribute((optnone)) void foo(int z, A a) { a.z[0] = 100; }
> > > > INLINE static int inner() { try {foo(1, A());}catch(int){} return 3; }
> > > > void bar() {
> > > >  for (int i = 0; i < 10; ++i) {
> > > >    foo(inner(), A(1));
> > > >  }
> > > > }
> > > > int main() { bar(); }
> > > > ```
> > > Interesting...
> > > noinline is fine (with and without removing the stack restores), but always_inline produces incorrect results. Keeping the stack restores, the printfs print a random value (consistent within a run), and removing the outer stack restore makes it crash after the first iteration.
> > > 
> > > So this is a bug with the current implementation of inalloca? I can't tell if it's specifically an inalloca/exception/inlining issue (the LLVM IR looks fine at a quick glance) or a deeper exception handling issue.
> > > 
> > > This reminds me of https://gcc.gnu.org/onlinedocs/gcc/Statement-Exprs.html which Reid has brought up to me before, where you can apparently jump out while in the middle of constructing arguments.
> > I tried putting the outer stackrestore in the exception path instead of the non-exceptional path and that makes it produce the right output without stack leaks.
> > 
> > Filed https://bugs.llvm.org/show_bug.cgi?id=46386.
> One, it looks like llvm::CloneBasicBlock doesn't recognize inalloca allocations as dynamic in some cases.  That's pretty clearly a bug, and I think fixing that might be enough to deal with my earlier testcase.  But really, it shouldn't matter: the frontend should wrap inalloca allocations with appropriate stacksave/stackrestore operations anyway.
> 
> Two, it looks like clang has some general issues with the interaction between dynamic allocation and exceptions.  Take the following testcase:
> 
> ```
> int z = 10000;
> __attribute((optnone)) int bar(int*z) { int a = *z; throw a;}
> int main() {
>     for (int i = 0; i < 1000; ++i) {
>     try {
>         int a[z];
>         bar(a);
>         __builtin_trap();
>     } catch (...) {
>     }
>     }
> }
> ```
> 
> This currently overflows the stack on Linux because there isn't an appropriate stackrestore anywhere.
> 
> Three, I suspect that unwinding on 32-bit Windows is freeing dynamic allocations in cases where it shouldn't, and this is hiding some inalloca issues.  Don't have a good testcase for this; this is just what I suspect from not seeing stack overflows in cases where the IR indicates there should be stack overflows.
Fixing CloneBasicBlock doesn't seem to fix the example you gave.

Your example makes more sense as to why the stackrestore is necessary, thanks.

(also I really do appreciate you taking the time to look into things and explain things to me)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80951/new/

https://reviews.llvm.org/D80951