[PATCH] D64088: [DDG] DirectedGraph as a base class for various dependence graphs such as DDG and PDG.

Tue Jul 9 15:40:10 PDT 2019

Meinersbur added a comment.

In D64088#1577050 <https://reviews.llvm.org/D64088#1577050>, @bmahjour wrote:

> In D64088#1574555 <https://reviews.llvm.org/D64088#1574555>, @Meinersbur wrote:
>
> > To reduce the number of allocations, have you thought about making EdgeList contain the edges objects instead of pointers? The edges would have to be copied/moved into the list and edges could not be compared by identity. Is this semantic needed/are edge objects large?
> >
> > Since the edge class does not contain the source node, the same edge object could be put into multiple outgoing edges lists. Is this supported?
>
>
> I think we would be better off using pointers in this case, because of the following reasons:
>
> 1. Using pointers gives the clients freedom to use polymorphic behavior.

I don't see why you would go the way implementing compile-time template polymorphism, but introduce vtables in derived classes.

> 2. Using pointers avoids the copy and moves you mentioned. Given that this class is intended to be extended by client code, it's probably better not to assume that the copy/moves will always be cheap.

Depends on the size of the edge objects. I'd expect the to be small, maybe just with an enum with the dependence kind. Those are cheap to copy compared to a heap allocation for each edges (cf. pass-by-value vs. pass-by-const-reference). Even if an edge object is large, it can still implement move ctors/assignment operators and a pimpl idiom.

> 3. Using pointers allow us to do the edge optimization you mentioned. This is currently not being done, it's something we can look into if memory usage becomes an issue.

I do not recommend this as it raises the complexity significantly: Changing one edges' property might change edges coming from other nodes as well. Only edges with the same target node even could benefit from it.

There might be reasons to prefer allocated edges (such as having an identity), but it depends on how they are used.

This graph is only walkable in one direction. Are you sure you don't need the other direction as well to answer queries such as "which statements must be executed before the current statement" in addition to "which statements can only execute after the current statement"?

================
Comment at: llvm/include/llvm/ADT/DirectedGraph.h:118-120
+    if (llvm::find_if(Edges, [&E](const EdgeType *Edge) {
+          return *Edge == E;
+        }) != Edges.end())
----------------
fhahn wrote:
> bmahjour wrote:
> > Meinersbur wrote:
> > > This makes me worry about scaling. Is this method used often? How large is the worst practical edge list? Did you think about using `SetVector`?
> > > 
> > > For the sake of avoiding premature optimization, we might worry about it when it becomes a problem.
> > This is a good question. The method is used fairly frequently when building the graph. I do not have comprehensive stats on the number of nodes and edges and their ratios when building large applications. The space complexity of the DDG depends on a number of factors such as:
> > 
> > 1. The number of instructions being analyzed. For example if DDG is run as a function pass it is more likely to result in larger number of nodes (and consequently edges) than if it is run as a loop pass. Of course this also depends on the size of the functions and loop bodies.
> > 2. The quality of the dependence analysis. If dependence analysis gives us pessimistic results we end up creating more edges between nodes.
> > 3. How well we are able to simplify the graph. Sometimes it's possible to collapse an edge and merge two nodes together. The more we can simplify the less nodes and edges we will have.
> > 
> > Using SetVector will likely help with the compile-time performance, but it comes with a memory trade off. My preliminary tests suggest that time-complexity may be more of an issue than memory consumption, so it maybe a good trade off. 
> > 
> > I agree it's better not to do premature optimizations at this point, and instead consider such improvements when more comprehensive stats become available.
> >This is a good question. The method is used fairly frequently when building the graph. I do not have comprehensive stats on the number of nodes and edges and their ratios when building large applications. The space complexity of the DDG depends on a number of factors such as:
> > 
> > The number of instructions being analyzed. For example if DDG is run as a function pass it is more likely to result in larger number of nodes (and consequently edges) than if it is run as a loop pass. Of course this also depends on the size of the functions and loop bodies.
> 
> From experience, people pass all kind of crazy code to LLVM and functions as well as loop nests can be huge. 
> 
> AFAIU you are planning to use this to also represent Def-Use dependencies? Have you considered integrating the existing information present in the IR and have the DDG just integrate it as an overlay?
> 
> > The quality of the dependence analysis. If dependence analysis gives us pessimistic results we end up creating more edges between nodes.
> > How well we are able to simplify the graph. Sometimes it's possible to collapse an edge and merge two nodes together. The more we can simplify the less nodes and edges we will have.
> > Using SetVector will likely help with the compile-time performance, but it comes with a memory trade off. My preliminary tests suggest that time-complexity may be more of an issue than memory consumption, so it maybe a good trade off.
> 
> Intuitively I agree that compile-time will be a bigger issue than memory usage, especially as we only need to keep the DDG of a loop nest/function around at a time. IIRC SetVector 'just' uses roughly twice as much memory.
> 
> > 
> > I agree it's better not to do premature optimizations at this point, and instead consider such improvements when more comprehensive stats become available.
> 
> I think it is definitely worth discussing/thinking about what the right data structure is here to start with. Compile-time problems, especially the edge cases tend to appear a while after the patches land in master, once they made it an a range of production compilers and are used to compile very large code bases. At that point it is hard to quickly fix the issue.
Depends on what  If transitive dependencies are added explicitly, there an be a lot of edges in the graph.

Could delegate the responsibility of adding a an edge only once to the caller? Very often by construction (newly created edge) this is not even possible. Instead, check it in an assert.

================
Comment at: llvm/include/llvm/ADT/DirectedGraph.h:287
+    for (NodeType *N : Nodes)
+      N->clear();
+    Nodes.clear();
----------------
bmahjour wrote:
> Meinersbur wrote:
> > Is clearing each individual node necessary? I wouldn't expect them to be re-used.
> Not sure if I understand your question...we need to clear each node so that their outgoing edges are removed.
But there are no nodes in the graph after `clear()`; it is irrelevant what the previous nodes store, they are not part of the graph anymore.

================
Comment at: llvm/include/llvm/ADT/DirectedGraph.h:173
+public:
+  using NodeListTy = SmallVector<NodeType *, 10>;
+  using EdgeListTy = typename NodeType::EdgeListTy;
----------------
The small capacity of the node list (and even better: the container implementation) is an implementation detail and should not be public.

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D64088/new/

https://reviews.llvm.org/D64088