[PATCH] D142620: [Coroutines] Improve rematerialization stage

Tue Feb 7 17:59:53 PST 2023

ChuanqiXu accepted this revision.
ChuanqiXu added a comment.
This revision is now accepted and ready to land.

LGTM. Thanks.

================
Comment at: llvm/lib/Transforms/Coroutines/CoroFrame.cpp:2248
+    RematGraph *RG = E.second.get();
+    ReversePostOrderTraversal<RematGraph *> RPOT(RG);
+    using rpo_iterator = ReversePostOrderTraversal<RematGraph *>::rpo_iterator;
----------------
dstuttard wrote:
> ChuanqiXu wrote:
> > dstuttard wrote:
> > > ChuanqiXu wrote:
> > > > jsilvanus wrote:
> > > > > ChuanqiXu wrote:
> > > > > > It is expensive to create ReversePostOrderTraversal. So it looks not good to construct it in a loop.
> > > > > Pedantically speaking, I'm not sure constructing the `ReversePostOrderTraversal` in a loop here is an issue: It being "expensive" just means it does the graph traversal in the constructor, so its run time is linear in the size of the graph.
> > > > > But here we are using it to traverse *different* graphs, all of which have been constructed before, so the runtime can be amortized into the construction of those graphs, or also into the traversal that is done later.
> > > > > 
> > > > > What we should not do is re-creating `ReversePostOrderTraversal` iterator objects for the same graph in a loop, because that wastes runtime.
> > > > > 
> > > > > Still, one might argue that constructing all those graphs with overlapping nodes, i.e. possibly multiple graphs having a node for the same `Instruction*`, is a fundamental runtime issue. Not sure if that really can become an issue?
> > > > Yeah, the key point here is that how many overlapping nodes there is. Have you measured the compile-time, run time performance or memory usages? Then we can have a better feeling. For example, we can decide if we want to limit the depth of the graph then.
> > > This work was done to speed up materialization. We needed a lot of rematerialization to happen, and initially just increased the number of iterations from 4 to a larger number.
> > > This didn't work very well and was extremely slow - hence this re-work.
> > > 
> > > I haven't done timings for smaller amounts of remat, but I can do that if you think it is useful.
> > > 
> > > I did wonder though if limiting the depth with an option might be useful - we want as much as possible, but that's probably not true for all applications.
> > > 
> > > I'm not sure about the overlapping nodes actually being an issue here - I did attempt to create a test case that demonstrated this, but I'm not sure I was entirely successful (all the tests ended up with the minimum set for the instructions being rematerialized).
> > > I haven't done timings for smaller amounts of remat, but I can do that if you think it is useful.
> > >
> > > I did wonder though if limiting the depth with an option might be useful - we want as much as possible, but that's probably not true for all applications.
> > 
> > I am not sure if it is a good idea to limit the depth too. I mean we need more data to make the decision. Since this change is not a pure win theoretically. There may be edge cases or everything would be fine. I am not blocking this. I just say we need more things to convince ourselves that this is a good change generally. Specially, folly may be a good choice to have a test. Or any other libraries that have a lot of  coroutines.
> > 
> I tried using folly to test this - not sure my methodology is sound though:
> 
> Compile clang and clang++ with/without these changes
> Set up to build folly by setting:
> CC=/just/built/clang
> CXX=/just/built/clang++
> CXXFLAGS="-std=c++20"
> CCACHE_DISABLE=1 (to allow for multiple build runs)
> 
> I think that should be sufficient to enable it?
> 
> Also ran folly tests
> (Verified from build output that the correct compiler is being used).
> 
> Results:
> Build time WITH changes (tried it a couple of times):
> Run 1
> real    1m58.627s
> user    39m49.991s
> sys     1m27.272s
> 
> Run 2
> real    1m54.844s
> user    39m55.198s
> sys     1m28.330s
> 
> Test time WITH changes (multiple runs):
> 32.28 secs
> 41.39 secs
> 34.91 secs
> 36.18 secs
> 35.94 secs
> 
> Build time WITHOUT changes (multiple runs):
> Run 1
> real    1m55.352s
> user    39m33.938s
> sys     1m25.716s
> 
> Run 2
> real    1m58.287s
> user    39m30.488s
> sys     1m26.247s
> 
> Run 3
> real    1m54.915s
> user    39m31.783s
> sys     1m25.420s
> 
> Test time WITHOUT changes (multiple runs):
> 42.23 secs
> 36.24 secs
> 41.64 secs
> 40.92 secs
> 
> If this is valid - then it doesn't seem to make much difference. Arguably the test time is slightly better with the changes, but there seems to be more variation run-to-run than there is with/without.
Good enough to know that this won't make things worse.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142620/new/

https://reviews.llvm.org/D142620