[PATCH] D136285: Bad optimization with alloca and intrinsic function stackrestore

Fri Nov 4 07:21:30 PDT 2022

jamieschmeiser abandoned this revision.
jamieschmeiser added inline comments.

================
Comment at: llvm/test/Transforms/MemCpyOpt/stackrestore.ll:97
+  %SS = tail call ptr @llvm.stacksave()
+  %A2 = alloca [56 x i8], align 4
+  store i8 1, ptr %A2, align 4
----------------
efriedma wrote:
> jamieschmeiser wrote:
> > rnk wrote:
> > > jamieschmeiser wrote:
> > > > rnk wrote:
> > > > > My expectation is that all the allocas here are static: They are part of the entry block, they will be part of the initial stack allocation, they will not be affected by stacksave/restore. This does mean that simplifycfg can extend the lifetime of a stack allocation, but to my knowledge, that's valid, the program should have the same observable behavior.
> > > > simplifycfg extending the lifetime of a stack allocation does have observable behaviour changes.  As pointed out in previous comments, it seems that stackrestore should not have tail on it, so consider the test IR without tail call on stackrestore and the code not in the entry block.
> > > > 
> > > > ```
> > > > define dso_local signext i32 @test() {
> > > > associate006_entry:
> > > >   br label %b1
> > > > 
> > > > b1:
> > > >   %A1 = alloca [56 x i8], align 8
> > > >   %SS = tail call ptr @llvm.stacksave()
> > > >   %A2 = alloca [56 x i8], align 4
> > > >   store i8 1, ptr %A2, align 4
> > > >   %GEP1 = getelementptr inbounds i8, ptr %A2, i32 8
> > > >   store i8 1, ptr %GEP1, align 4
> > > >   %GEP2 = getelementptr inbounds i8, ptr %A2, i32 12
> > > >   store i8 1, ptr %GEP2, align 4
> > > >   call void @llvm.memcpy.p0.p0.i32(ptr noundef nonnull align 8 dereferenceable(56) %A1, ptr noundef nonnull align 4 dereferenceable(56) %A2, i32 56, i1 false)
> > > >   call void @llvm.stackrestore(ptr %SS)
> > > >   %A3 = alloca [56 x i8], align 4
> > > >   %uglygep123 = getelementptr i8, ptr %A3, i32 0
> > > >   call void @llvm.memcpy.p0.p0.i32(ptr noundef nonnull align 1 dereferenceable(56) %uglygep123, ptr noundef nonnull align 8 dereferenceable(56) %A1, i32 56, i1 false)
> > > >   ret i32 0
> > > > }
> > > > ```
> > > > This will not pass the isStaticAlloca query and (due to the removed tail call) on stackrestore, no optimization will be performed by memcpyopt.  If simplifycfg is performed first, the two blocks are collapsed, this becomes the entry block and memcpyopt will do the optimization, changing the observable behaviour.
> > > As long as the lifetimes of A1, A2, and A3 all last for the entire stack, I don't see any problems with the transform. From what I can tell using llc, they are all static allocas and the result of the program is the same.
> > > 
> > > I think you are right that stackrestore should not have a tail call marker on it, but that doesn't have any bearing on whether these are static or dynamic allocas.
> > > 
> > > Maybe this test is overreduced. Can you say more about why this transform is causing problems?
> > It isn't clear from the test exactly what is going wrong, so I've added a bit more to illustrate, as well as comments tracing the values.
> > You can see the optimization performed using  opt 1028.ll -passes=memcpyopt -S -print-changed=diff-quiet > /dev/
> > The contents of A3 will be different before and after the optimization.
> > 
> > ```
> > ---- 1028.ll ----
> > ; Function Attrs: nobuiltin norecurse
> > define dso_local signext i32 @test() {
> > associate006_entry:
> >   %A1 = alloca [2 x i8], align 8
> >   store i8 0, ptr %A1, align 4
> >   ; A1[0] == 0
> >   %G1 = getelementptr inbounds i8, ptr %A1, i32 1
> >   store i8 0, ptr %G1, align 4
> >   ; A1[1] == 0
> >   %SS = tail call ptr @llvm.stacksave()
> >   %A2 = alloca [2 x i8], align 4
> >   store i8 1, ptr %A2, align 4
> >   ; A2[0] == 1
> >   %GEP1 = getelementptr inbounds i8, ptr %A2, i32 1
> >   store i8 1, ptr %GEP1, align 4
> >   ; A2[1] == 1
> >   call void @llvm.memcpy.p0.p0.i32(ptr noundef nonnull align 8 dereferenceable(2) %A1, ptr noundef nonnull align 4 dereferenceable(2) %A2, i32 2, i1 false)
> >   ; A1[0] == 1, A1[1] == 1
> >   tail call void @llvm.stackrestore(ptr %SS)
> >   ; A1[0] == 0, A1[1] == 0
> >   %A3 = alloca [2 x i8], align 4
> >   %uglygep123 = getelementptr i8, ptr %A3, i32 0
> >   call void @llvm.memcpy.p0.p0.i32(ptr noundef nonnull align 1 dereferenceable(2) %uglygep123, ptr noundef nonnull align 8 dereferenceable(2) %A1, i32 2, i1 false)
> >   ; A3[0] == A1[0] == 0, A3[1] == A1[1] == 0
> >   ; memcpyopt changes IR to
> >   ;   call void @llvm.memcpy.p0.p0.i32(ptr align 1 %uglygep123, ptr align 4 %A2, i32 2, i1 false)
> >   ; A3[0] == A2[0] == 1, A3[1] == A2[1] == 1
> > 
> >   ret i32 0
> > }
> > 
> > ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn
> > declare ptr @llvm.stacksave()
> > 
> > ; Function Attrs: mustprogress nocallback nofree nosync nounwind willreturn
> > declare void @llvm.stackrestore(ptr)
> > 
> > ; Function Attrs: argmemonly mustprogress nocallback nofree nounwind willreturn
> > declare void @llvm.memcpy.p0.p0.i32(ptr noalias nocapture writeonly, ptr noalias nocapture readonly, i32, i1 immarg)
> > 
> > ```
> >   ; A1[0] == 1, A1[1] == 1
> >   tail call void @llvm.stackrestore(ptr %SS)
> >   ; A1[0] == 0, A1[1] == 0
> 
> If you compile the given example with "llc -O0", you'll see that `@llvm.stacksave` corresponds to `movq %rsp, %rax`, and `@llvm.stackrestore` corresponds to `movq %rax, %rsp`.  Neither RAX nor RSP are modified between these two instructions, so `@llvm.stackrestore` has no effect in this context.  I'm not sure why you think the value of A1 changes after the stackrestore.
> 
> -----
> 
> In general, the best evidence of a miscompile is to present a program that actually produces different results based on optimizations.  Your example is reduced too far to show anything useful; the array A3 is dead after the memcpy, so its contents don't actually affect the output of the program.
I went back to the original source (which I am not at liberty to share) to get a complete example and discovered, on further testing, that the code coming out of memcpyopt is correct and it is actually the next opt pass to make a change (dsepass) that causes the mis-optimization.  I tested this using llc on the IR before and after dse.  Thanks everyone for the help and I will continue looking into the problem.  I am abandoning this revision.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D136285/new/

https://reviews.llvm.org/D136285