[PATCH] D85614: [TRE] Reland: allow TRE for non-capturing calls.

Thu Oct 8 15:07:30 PDT 2020

efriedma added inline comments.

================
Comment at: llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp:600
+                       CI->getArgOperand(OpndIdx),
+                       /*SrcAlign*/ Alignment, Size);
+
----------------
avl wrote:
> efriedma wrote:
> > avl wrote:
> > > efriedma wrote:
> > > > avl wrote:
> > > > > efriedma wrote:
> > > > > > avl wrote:
> > > > > > > efriedma wrote:
> > > > > > > > avl wrote:
> > > > > > > > > efriedma wrote:
> > > > > > > > > > This is the right direction, but I'm not sure it does the right thing in general.
> > > > > > > > > > 
> > > > > > > > > > In particular, it's possible that `CI->getArgOperand(OpndIdx)` points to one of the allocas created by createTempForByValOperand(); if it does, then this copy might be clobbering data you need for subsequent copies.  Again, consider the `void dostuff(A arg, Arg b) { dostuff(b, a); }` case.
> > > > > > > > > > 
> > > > > > > > > > This is why the general procedure I outlined has an extra step: we copy the temporary allocas to the original byval arguments, and use the byval arguments as the operands to the PHI.  That way, `CI->getArgOperand(OpndIdx)` never points to one of the allocas allocated by createTempForByValOperand.
> > > > > > > > > > This is the right direction, but I'm not sure it does the right thing in general.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > It looks like "void dostuff(A arg, Arg b) { dostuff(b, a); }" test case is working correctly:
> > > > > > > > > 
> > > > > > > > > $ cat test.cpp
> > > > > > > > > #include <stdio.h>
> > > > > > > > > typedef struct A { long long x[10] = {0}; } A;
> > > > > > > > > A global;
> > > > > > > > > void dostuff(A a, A b, int i) { if (i==10) return; a.x[5]++; printf("%lld %lld\n", a.x[5], b.x[5]); dostuff(b, a, i+1); }
> > > > > > > > > __attribute((optnone)) int main() { dostuff(global, global, 0); }
> > > > > > > > > $ clang++ test.cpp -O3 
> > > > > > > > > $ ./a.out
> > > > > > > > > 1 0
> > > > > > > > > 1 1
> > > > > > > > > 2 1
> > > > > > > > > 2 2
> > > > > > > > > 3 2
> > > > > > > > > 3 3
> > > > > > > > > 4 3
> > > > > > > > > 4 4
> > > > > > > > > 5 4
> > > > > > > > > 5 5
> > > > > > > > > 
> > > > > > > > > > In particular, it's possible that CI->getArgOperand(OpndIdx) points to one of the allocas created by createTempForByValOperand(); if it does, then this copy might be clobbering data you need for subsequent copies. Again, consider the void dostuff(A arg, Arg b) { dostuff(b, a); } case.
> > > > > > > > > > 
> > > > > > > > > > This is why the general procedure I outlined has an extra step: we copy the temporary allocas to the original byval arguments, and use the byval arguments as the operands to the PHI. That way, CI->getArgOperand(OpndIdx) never points to one of the allocas allocated by createTempForByValOperand.
> > > > > > > > > 
> > > > > > > > > If I am not mistaken the CI->getArgOperand(OpndIdx) could not point to one of the allocas created by createTempForByValOperand(). There is not setArgOperand() which set operand to newly created temps by createTempForByValOperand(). Thus CI->getArgOperand(OpndIdx) always point to the alloca created for recursive call to dostuff(). 
> > > > > > > > > 
> > > > > > > > > For the ""void dostuff(A arg, Arg b) { dostuff(b, a); }" test case:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > before tre:
> > > > > > > > > 
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > >   %2 = bitcast %struct.A* %agg.tmp to i8*
> > > > > > > > >   %3 = bitcast %struct.A* %b to i8*
> > > > > > > > >   call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %2, i8* nonnull align 8 dereferenceable(80) %3, i64 80, i1 false), !tbaa.struct !6
> > > > > > > > >   ^^^^^^^^^^^^^^^^ copy to %agg.tmp which is first byval argument
> > > > > > > > >   %4 = bitcast %struct.A* %agg.tmp5 to i8*
> > > > > > > > >   %5 = bitcast %struct.A* %a to i8*
> > > > > > > > >   call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %4, i8* nonnull align 8 dereferenceable(80) %5, i64 80, i1 false), !tbaa.struct !6
> > > > > > > > >   ^^^^^^^^^^^^^^^^ copy to %agg.tmp5 which is second byval argument
> > > > > > > > >   %add = add nsw i32 %i, 1
> > > > > > > > >   call void @_Z7dostuff1AS_i(%struct.A* nonnull byval(%struct.A) align 8 %agg.tmp, %struct.A* nonnull byval(%struct.A) align 8 %agg.tmp5, i32 %add)
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > > after tre:
> > > > > > > > > 
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > > tailrecurse:                                      ; preds = %if.end, %entry
> > > > > > > > >   %a.tr = phi %struct.A* [ %a, %entry ], [ %agg.tmp7, %if.end ]
> > > > > > > > >   %b.tr = phi %struct.A* [ %b, %entry ], [ %agg.tmp58, %if.end ]
> > > > > > > > > 
> > > > > > > > > if.end:
> > > > > > > > > ...
> > > > > > > > >   %2 = bitcast %struct.A* %agg.tmp to i8*
> > > > > > > > >   %3 = bitcast %struct.A* %b.tr to i8*
> > > > > > > > >   call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %2, i8* nonnull align 8 dereferenceable(80) %3, i64 80, i1 false), !tbaa.struct !6
> > > > > > > > >   ^^^^^^^^^^^^^^^^ copy to %agg.tmp which is first byval argument
> > > > > > > > >   %4 = bitcast %struct.A* %agg.tmp5 to i8*
> > > > > > > > >   %5 = bitcast %struct.A* %a.tr to i8*
> > > > > > > > >   call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %4, i8* nonnull align 8 dereferenceable(80) %5, i64 80, i1 false), !tbaa.struct !6
> > > > > > > > >   ^^^^^^^^^^^^^^^^ copy to %agg.tmp5 which is second byval argument
> > > > > > > > >   %add = add nsw i32 %i.tr, 1
> > > > > > > > >   %6 = bitcast %struct.A* %agg.tmp7 to i8*
> > > > > > > > >   %7 = bitcast %struct.A* %agg.tmp to i8*
> > > > > > > > >   call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %6, i8* align 8 %7, i64 80, i1 false)
> > > > > > > > >   ^^^^^^^^^^^^^^^^ copy to the agg.tmp7 which is used as input for next iteration
> > > > > > > > >   %8 = bitcast %struct.A* %agg.tmp58 to i8*
> > > > > > > > >   %9 = bitcast %struct.A* %agg.tmp5 to i8*
> > > > > > > > >   call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %8, i8* align 8 %9, i64 80, i1 false)
> > > > > > > > >   ^^^^^^^^^^^^^^^^ copy to the agg.tmp58 which is used as input for next iteration
> > > > > > > > >   br label %tailrecurse
> > > > > > > > > 
> > > > > > > > > ```
> > > > > > > > > 
> > > > > > > > > Probably the only difference with your description is that this patch creates additional temporarily variables(agg.tmp7/agg.tmp58) while it could use locations of a/b parameters instead of newly created temps. 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > I guess if you're looking at code emitted directly by clang, you'll end up with "extra" aggregates due to the way clang does call lowering.  That said, the temporary aggregates aren't actually required; optimizations like memcpyopt can remove them.  For example, on master, "clang -O2 -emit-llvm" produces a call like `tail call void @_Z7dostuff1AS_i(%struct.A* nonnull byval(%struct.A) align 4 %b, %struct.A* nonnull byval(%struct.A) align 4 %a, i32 %add)`.
> > > > > > > right. But for the case "void dostuff(A arg, Arg b) { dostuff(b, a); }" these temps and memcpy are not useless and then they would not be removed by memcpyopt.
> > > > > > > 
> > > > > > > Anyway, minimizing number of temps is also useful thing. Will update a patch to use function arguments locations instead of created temps.
> > > > > > Take a look at the output of `clang -O2 -emit-llvm` for the following: `typedef struct { int x[100]; } A; void foo(A, A); void bar(A a, A b) { return foo(b, a); }`.  No memcpy.
> > > > > For above test case that is probably correct(since foo is a last call). But for that test case :
> > > > > 
> > > > > typedef struct { int x[100]; } A; void foo(A, A); void bar(A a, A b) { foo(b, a); foo(b,a); }
> > > > > 
> > > > > define void @_Z3bar1AS_(%struct.A* byval nocapture readonly align 8, %struct.A* byval nocapture readonly align 8) local_unnamed_addr #0 {
> > > > >   call void @_Z3foo1AS_(%struct.A* byval nonnull align 8 %1, %struct.A* byval nonnull align 8 %0)
> > > > >   call void @_Z3foo1AS_(%struct.A* byval nonnull align 8 %1, %struct.A* byval nonnull align 8 %0)
> > > > >   ret void
> > > > > }
> > > > > 
> > > > > What if first foo writes at byvalue location ? In this case second call to foo() would receive incorrect arguments.
> > > > > Is it correct behavior ?
> > > > byval implicitly makes a copy of the underlying memory. If foo modifies the memory, it's modifying its own copy, not the version in bar.
> > > looks like I lost a bit. That : "byval implicitly makes a copy of the underlying memory" is understood. So for that test case "typedef struct { int x[100]; } A; void foo(A, A); void bar(A a, A b) { return foo(b, a); }" -mllvm -print-after-all shows : 
> > > 
> > > 
> > > ```
> > >   %agg.tmp = alloca %struct.A, align 8
> > >   %agg.tmp1 = alloca %struct.A, align 8
> > >   %0 = bitcast %struct.A* %agg.tmp to i8*
> > >   %1 = bitcast %struct.A* %b to i8*
> > >   call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %0, i8* align 8 %1, i64 400, i1 false), !tbaa.struct !2
> > >   %2 = bitcast %struct.A* %agg.tmp1 to i8*
> > >   %3 = bitcast %struct.A* %a to i8*
> > >   call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %2, i8* align 8 %3, i64 400, i1 false), !tbaa.struct !2
> > >   call void @_Z3foo1AS_(%struct.A* byval(%struct.A) align 8 %agg.tmp, %struct.A* byval(%struct.A) align 8 %agg.tmp
> > > 
> > > ```
> > > 
> > > Why it is important that output of "clang -O2 -emit-llvm" for the same test case does not contain calls to memcpy ?
> > I'm not completely sure what you're asking.
> > 
> > In general, we expect that the TRE pass should be able to handle arbitrary LLVM IR, even if clang with the standard pass pipeline can't generate IR with that pattern.  So TRE needs to be able to handle the form where the byval argument is passed directly to a tail call.
> >In general, we expect that the TRE pass should be able to handle arbitrary LLVM IR, even if clang with the standard pass pipeline can't generate IR with that pattern. So TRE needs to be able to handle the form where the byval argument is passed directly to a tail call.
> 
> The IR for the testcase "typedef struct { int x[100]; } A; void foo(A, A); void bar(A a, A b) { return foo(b, a); }"  before TRE loos like this currently :
> 
> 
> ```
>   %0 = bitcast %struct.A* %agg.tmp to i8*
>   %1 = bitcast %struct.A* %b to i8*
>   call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(400) %0, i8* nonnull align 8 dereferenceable(400) %1, i64 400, i1 false), !tbaa.struct !2
>   %2 = bitcast %struct.A* %agg.tmp1 to i8*
>   %3 = bitcast %struct.A* %a to i8*
>   call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(400) %2, i8* nonnull align 8 dereferenceable(400) %3, i64 400, i1 false), !tbaa.struct !2
>   tail call void @_Z3foo1AS_(%struct.A* nonnull byval(%struct.A) align 8 %agg.tmp, %struct.A* nonnull byval(%struct.A) align 8 %agg.tmp1)
> 
> ```
> Did I correctly understand that TRE should also correctly work if above memcpy calls are not generated ?
Right, it should still work if the memcpy calls are not generated, or optimized out.  For example, we shouldn't miscompile if someone writes `clang foo.c -o - -S -emit-llvm -O2 -mllvm -disable-llvm-optzns | opt -S -sroa -memcpyopt -instcombine -tailcallelim`.

Usually, memcpyopt doesn't run before tailcallelim, but allowing users to specify arbitrary optimizations in any order is part of what makes LLVM IR transforms flexible.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D85614/new/

https://reviews.llvm.org/D85614