[PATCH] D142620: [Coroutines] Improve rematerialization stage

Tue Feb 7 08:45:32 PST 2023

dstuttard added inline comments.

================
Comment at: llvm/lib/Transforms/Coroutines/CoroFrame.cpp:324
+namespace {
+struct RematNode {
+  Instruction *Node;
----------------
ChuanqiXu wrote:
> Since `RematNode` is only used in `RematGraph`. I feel slightly better to make `RematNode ` a private class definition for `RematGraph`.
Defining the GraphTraits is harder if RematNode is a private class definition for RematGraph - but I agree there's no reason that RematNode shouldn't be a class definition in RematGraph, so I've done that.

================
Comment at: llvm/lib/Transforms/Coroutines/CoroFrame.cpp:326
+  Instruction *Node;
+  SmallVector<RematNode *> Children;
+  RematNode() = default;
----------------
ChuanqiXu wrote:
> dstuttard wrote:
> > ChuanqiXu wrote:
> > > What does `Children` mean here?
> > I just needed a reasonable name for the next nodes - they are defined as being one edge further away from the root of the graph, so seemed like a reasonable name to use.
> > Do you think something else would be better?
> I guess `operands` or `reversed_successor` or something similar to that may be better. What's more important here is that we lack a lot of comments here for `RematNode` and `RematGraph`. Otherwise the code readers can't understand what they mean.  
I think I prefer Operands, so I've changed it to that.
I've also added some more comments explaining RematGraph and RematNode.

================
Comment at: llvm/lib/Transforms/Coroutines/CoroFrame.cpp:366-372
+              for (auto &I : WorkList) {
+                if (I->Node == D) {
+                  NoMatch = false;
+                  N->Children.push_back(I.get());
+                  break;
+                }
+              }
----------------
ChuanqiXu wrote:
> dstuttard wrote:
> > sebastian-ne wrote:
> > > Maybe it makes sense to use a Set for the Worklist?
> > Maybe - are you thinking that a set would remove the need to check for duplicates? I'm not sure it makes things much better - maybe it removes the needs to iterate the worklist, I can't remember if there's a requirement to do this in order though.
> Given the graph should be a DAG, the order should not be important here. So a set here may be better.
I don't think there's much of an advantage to doing this, so I've left it as a deque for now. I'm open to being convinced that a set is better, but I'd rather not rework this if possible.

================
Comment at: llvm/lib/Transforms/Coroutines/CoroFrame.cpp:2225-2226

 // For every use of the value that is across suspend point, recreate that value
 // after a suspend point.
+static void rewriteMaterializableInstructions(
----------------
ChuanqiXu wrote:
> dstuttard wrote:
> > ChuanqiXu wrote:
> > > The comment looks not precise after we land this patch.
> > I'm not sure that the result of this patch is any different from what happened before - other than you might get more than 4 dependent instructions rematerialized.
> > What do you think needs changing here?
> The comment is `For every use of "the value"`. However, rewriteMaterializableInstructions don't have `the value` from the signature. Also the reader can't know what is `RematGraph`. So it may be hard for users to understand.
I've reworded this. Hopefully it's clearer now.

================
Comment at: llvm/lib/Transforms/Coroutines/CoroFrame.cpp:2248
+    RematGraph *RG = E.second.get();
+    ReversePostOrderTraversal<RematGraph *> RPOT(RG);
+    using rpo_iterator = ReversePostOrderTraversal<RematGraph *>::rpo_iterator;
----------------
ChuanqiXu wrote:
> dstuttard wrote:
> > ChuanqiXu wrote:
> > > jsilvanus wrote:
> > > > ChuanqiXu wrote:
> > > > > It is expensive to create ReversePostOrderTraversal. So it looks not good to construct it in a loop.
> > > > Pedantically speaking, I'm not sure constructing the `ReversePostOrderTraversal` in a loop here is an issue: It being "expensive" just means it does the graph traversal in the constructor, so its run time is linear in the size of the graph.
> > > > But here we are using it to traverse *different* graphs, all of which have been constructed before, so the runtime can be amortized into the construction of those graphs, or also into the traversal that is done later.
> > > > 
> > > > What we should not do is re-creating `ReversePostOrderTraversal` iterator objects for the same graph in a loop, because that wastes runtime.
> > > > 
> > > > Still, one might argue that constructing all those graphs with overlapping nodes, i.e. possibly multiple graphs having a node for the same `Instruction*`, is a fundamental runtime issue. Not sure if that really can become an issue?
> > > Yeah, the key point here is that how many overlapping nodes there is. Have you measured the compile-time, run time performance or memory usages? Then we can have a better feeling. For example, we can decide if we want to limit the depth of the graph then.
> > This work was done to speed up materialization. We needed a lot of rematerialization to happen, and initially just increased the number of iterations from 4 to a larger number.
> > This didn't work very well and was extremely slow - hence this re-work.
> > 
> > I haven't done timings for smaller amounts of remat, but I can do that if you think it is useful.
> > 
> > I did wonder though if limiting the depth with an option might be useful - we want as much as possible, but that's probably not true for all applications.
> > 
> > I'm not sure about the overlapping nodes actually being an issue here - I did attempt to create a test case that demonstrated this, but I'm not sure I was entirely successful (all the tests ended up with the minimum set for the instructions being rematerialized).
> > I haven't done timings for smaller amounts of remat, but I can do that if you think it is useful.
> >
> > I did wonder though if limiting the depth with an option might be useful - we want as much as possible, but that's probably not true for all applications.
> 
> I am not sure if it is a good idea to limit the depth too. I mean we need more data to make the decision. Since this change is not a pure win theoretically. There may be edge cases or everything would be fine. I am not blocking this. I just say we need more things to convince ourselves that this is a good change generally. Specially, folly may be a good choice to have a test. Or any other libraries that have a lot of  coroutines.
> 
I tried using folly to test this - not sure my methodology is sound though:

Compile clang and clang++ with/without these changes
Set up to build folly by setting:
CC=/just/built/clang
CXX=/just/built/clang++
CXXFLAGS="-std=c++20"
CCACHE_DISABLE=1 (to allow for multiple build runs)

I think that should be sufficient to enable it?

Also ran folly tests
(Verified from build output that the correct compiler is being used).

Results:
Build time WITH changes (tried it a couple of times):
Run 1
real    1m58.627s
user    39m49.991s
sys     1m27.272s

Run 2
real    1m54.844s
user    39m55.198s
sys     1m28.330s

Test time WITH changes (multiple runs):
32.28 secs
41.39 secs
34.91 secs
36.18 secs
35.94 secs

Build time WITHOUT changes (multiple runs):
Run 1
real    1m55.352s
user    39m33.938s
sys     1m25.716s

Run 2
real    1m58.287s
user    39m30.488s
sys     1m26.247s

Run 3
real    1m54.915s
user    39m31.783s
sys     1m25.420s

Test time WITHOUT changes (multiple runs):
42.23 secs
36.24 secs
41.64 secs
40.92 secs

If this is valid - then it doesn't seem to make much difference. Arguably the test time is slightly better with the changes, but there seems to be more variation run-to-run than there is with/without.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D142620/new/

https://reviews.llvm.org/D142620