[PATCH] D85614: [TRE] Reland: allow TRE for non-capturing calls.
Eli Friedman via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Oct 7 14:49:42 PDT 2020
efriedma added inline comments.
================
Comment at: llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp:600
+ CI->getArgOperand(OpndIdx),
+ /*SrcAlign*/ Alignment, Size);
+
----------------
avl wrote:
> efriedma wrote:
> > avl wrote:
> > > efriedma wrote:
> > > > avl wrote:
> > > > > efriedma wrote:
> > > > > > This is the right direction, but I'm not sure it does the right thing in general.
> > > > > >
> > > > > > In particular, it's possible that `CI->getArgOperand(OpndIdx)` points to one of the allocas created by createTempForByValOperand(); if it does, then this copy might be clobbering data you need for subsequent copies. Again, consider the `void dostuff(A arg, Arg b) { dostuff(b, a); }` case.
> > > > > >
> > > > > > This is why the general procedure I outlined has an extra step: we copy the temporary allocas to the original byval arguments, and use the byval arguments as the operands to the PHI. That way, `CI->getArgOperand(OpndIdx)` never points to one of the allocas allocated by createTempForByValOperand.
> > > > > > This is the right direction, but I'm not sure it does the right thing in general.
> > > > > >
> > > > >
> > > > > It looks like "void dostuff(A arg, Arg b) { dostuff(b, a); }" test case is working correctly:
> > > > >
> > > > > $ cat test.cpp
> > > > > #include <stdio.h>
> > > > > typedef struct A { long long x[10] = {0}; } A;
> > > > > A global;
> > > > > void dostuff(A a, A b, int i) { if (i==10) return; a.x[5]++; printf("%lld %lld\n", a.x[5], b.x[5]); dostuff(b, a, i+1); }
> > > > > __attribute((optnone)) int main() { dostuff(global, global, 0); }
> > > > > $ clang++ test.cpp -O3
> > > > > $ ./a.out
> > > > > 1 0
> > > > > 1 1
> > > > > 2 1
> > > > > 2 2
> > > > > 3 2
> > > > > 3 3
> > > > > 4 3
> > > > > 4 4
> > > > > 5 4
> > > > > 5 5
> > > > >
> > > > > > In particular, it's possible that CI->getArgOperand(OpndIdx) points to one of the allocas created by createTempForByValOperand(); if it does, then this copy might be clobbering data you need for subsequent copies. Again, consider the void dostuff(A arg, Arg b) { dostuff(b, a); } case.
> > > > > >
> > > > > > This is why the general procedure I outlined has an extra step: we copy the temporary allocas to the original byval arguments, and use the byval arguments as the operands to the PHI. That way, CI->getArgOperand(OpndIdx) never points to one of the allocas allocated by createTempForByValOperand.
> > > > >
> > > > > If I am not mistaken the CI->getArgOperand(OpndIdx) could not point to one of the allocas created by createTempForByValOperand(). There is not setArgOperand() which set operand to newly created temps by createTempForByValOperand(). Thus CI->getArgOperand(OpndIdx) always point to the alloca created for recursive call to dostuff().
> > > > >
> > > > > For the ""void dostuff(A arg, Arg b) { dostuff(b, a); }" test case:
> > > > >
> > > > >
> > > > > before tre:
> > > > >
> > > > > ```
> > > > >
> > > > > %2 = bitcast %struct.A* %agg.tmp to i8*
> > > > > %3 = bitcast %struct.A* %b to i8*
> > > > > call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %2, i8* nonnull align 8 dereferenceable(80) %3, i64 80, i1 false), !tbaa.struct !6
> > > > > ^^^^^^^^^^^^^^^^ copy to %agg.tmp which is first byval argument
> > > > > %4 = bitcast %struct.A* %agg.tmp5 to i8*
> > > > > %5 = bitcast %struct.A* %a to i8*
> > > > > call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %4, i8* nonnull align 8 dereferenceable(80) %5, i64 80, i1 false), !tbaa.struct !6
> > > > > ^^^^^^^^^^^^^^^^ copy to %agg.tmp5 which is second byval argument
> > > > > %add = add nsw i32 %i, 1
> > > > > call void @_Z7dostuff1AS_i(%struct.A* nonnull byval(%struct.A) align 8 %agg.tmp, %struct.A* nonnull byval(%struct.A) align 8 %agg.tmp5, i32 %add)
> > > > > ```
> > > > >
> > > > > after tre:
> > > > >
> > > > > ```
> > > > >
> > > > > tailrecurse: ; preds = %if.end, %entry
> > > > > %a.tr = phi %struct.A* [ %a, %entry ], [ %agg.tmp7, %if.end ]
> > > > > %b.tr = phi %struct.A* [ %b, %entry ], [ %agg.tmp58, %if.end ]
> > > > >
> > > > > if.end:
> > > > > ...
> > > > > %2 = bitcast %struct.A* %agg.tmp to i8*
> > > > > %3 = bitcast %struct.A* %b.tr to i8*
> > > > > call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %2, i8* nonnull align 8 dereferenceable(80) %3, i64 80, i1 false), !tbaa.struct !6
> > > > > ^^^^^^^^^^^^^^^^ copy to %agg.tmp which is first byval argument
> > > > > %4 = bitcast %struct.A* %agg.tmp5 to i8*
> > > > > %5 = bitcast %struct.A* %a.tr to i8*
> > > > > call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %4, i8* nonnull align 8 dereferenceable(80) %5, i64 80, i1 false), !tbaa.struct !6
> > > > > ^^^^^^^^^^^^^^^^ copy to %agg.tmp5 which is second byval argument
> > > > > %add = add nsw i32 %i.tr, 1
> > > > > %6 = bitcast %struct.A* %agg.tmp7 to i8*
> > > > > %7 = bitcast %struct.A* %agg.tmp to i8*
> > > > > call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %6, i8* align 8 %7, i64 80, i1 false)
> > > > > ^^^^^^^^^^^^^^^^ copy to the agg.tmp7 which is used as input for next iteration
> > > > > %8 = bitcast %struct.A* %agg.tmp58 to i8*
> > > > > %9 = bitcast %struct.A* %agg.tmp5 to i8*
> > > > > call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %8, i8* align 8 %9, i64 80, i1 false)
> > > > > ^^^^^^^^^^^^^^^^ copy to the agg.tmp58 which is used as input for next iteration
> > > > > br label %tailrecurse
> > > > >
> > > > > ```
> > > > >
> > > > > Probably the only difference with your description is that this patch creates additional temporarily variables(agg.tmp7/agg.tmp58) while it could use locations of a/b parameters instead of newly created temps.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > I guess if you're looking at code emitted directly by clang, you'll end up with "extra" aggregates due to the way clang does call lowering. That said, the temporary aggregates aren't actually required; optimizations like memcpyopt can remove them. For example, on master, "clang -O2 -emit-llvm" produces a call like `tail call void @_Z7dostuff1AS_i(%struct.A* nonnull byval(%struct.A) align 4 %b, %struct.A* nonnull byval(%struct.A) align 4 %a, i32 %add)`.
> > > right. But for the case "void dostuff(A arg, Arg b) { dostuff(b, a); }" these temps and memcpy are not useless and then they would not be removed by memcpyopt.
> > >
> > > Anyway, minimizing number of temps is also useful thing. Will update a patch to use function arguments locations instead of created temps.
> > Take a look at the output of `clang -O2 -emit-llvm` for the following: `typedef struct { int x[100]; } A; void foo(A, A); void bar(A a, A b) { return foo(b, a); }`. No memcpy.
> For above test case that is probably correct(since foo is a last call). But for that test case :
>
> typedef struct { int x[100]; } A; void foo(A, A); void bar(A a, A b) { foo(b, a); foo(b,a); }
>
> define void @_Z3bar1AS_(%struct.A* byval nocapture readonly align 8, %struct.A* byval nocapture readonly align 8) local_unnamed_addr #0 {
> call void @_Z3foo1AS_(%struct.A* byval nonnull align 8 %1, %struct.A* byval nonnull align 8 %0)
> call void @_Z3foo1AS_(%struct.A* byval nonnull align 8 %1, %struct.A* byval nonnull align 8 %0)
> ret void
> }
>
> What if first foo writes at byvalue location ? In this case second call to foo() would receive incorrect arguments.
> Is it correct behavior ?
byval implicitly makes a copy of the underlying memory. If foo modifies the memory, it's modifying its own copy, not the version in bar.
Repository:
rG LLVM Github Monorepo
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D85614/new/
https://reviews.llvm.org/D85614
More information about the llvm-commits
mailing list