[llvm-dev] Intended behavior of CGSCC pass manager.

Wed Jun 8 04:19:03 PDT 2016

Hi Chandler, Philip, Mehdi, (and llvm-dev,)

(this is partially a summary of some discussions that happened at the last
LLVM bay area social, and partially a discussion about the direction of the
CGSCC pass manager)

A the last LLVM social we discussed the progress on the CGSCC pass manager.
It seems like Chandler has a CGSCC pass manager working, but it is still
unresolved exactly which semantics we want (more about this below) that are
reasonably implementable.

AFAICT, there has been no public discussion about what exact semantics we
ultimately want to have. We should figure that out.

The main difficulty which Chandler described is the apparently quite
complex logic surrounding needing to run function passes nested within an
SCC pass manager, while providing some guarantees about exactly what order
the function passes are run. The existing CGSCC pass manager just punts on
some of the problems that arise (look in CGPassManager::runOnModule,
CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
Chandler has been trying to solve.

(
Why is this "function passes inside CGSCC passes" stuff interesting?
Because LLVM can do inlining on an SCC (often just a single function) and
then run function passes to simplify the function(s) in the SCC before it
tries to inline into a parent SCC. (the SCC visitation order is post-order)
For example, we may inline a bunch of code, but after inlining we can
tremendously simplify the function, and we want to do so before considering
this function for inlining into its callers so that we get an accurate
evaluation of the inline cost.
Based on what Chandler said, it seems that LLVM is fairly unique in this
regard and other compilers don't do this (which is why we can't just look
at how other compilers solve this problem; they don't have this problem
(maybe they should? or maybe we shouldn't?)). For example, he described
that GCC uses different inlining "phases"; e.g. it does early inlining on
the entire module, then does simplifications on the entire module, then
does late inlining on the entire module; so it is not able to incrementally
simplify as it inlines like LLVM does.
)

As background for what is below, the LazyCallGraph tracks two graphs: the
"call graph" and the "ref graph".
Conceptually, the call graph is the graph of direct calls, where indirect
calls and calls to external functions do not appear (or are connected to
dummy nodes). The ref graph is basically the graph of all functions
transitively accessible based on the globals/constants/etc. referenced by a
function (e.g. if a function `foo` references a vtable that is defined in
the module, there is an edge in the ref graph from `foo` to every function
in the vtable).
The call graph is a strict subset of the ref graph.

Chandler described that he had a major breakthrough in that the CGSCC pass
manager only had to deal with 3 classes of modifications that can occur:
- a pass may e.g. propagate a load of a function pointer into an indirect
call, turning it into an direct call. This requires adding an edge in the
CG but not in the ref graph.
- a pass may take a direct call and turn it into an indirect call. This
requires removing an edge from the CG, but not in the ref graph.
- a pass may delete a direct call. This removes an edge in the CG and also
in the ref graph.

>From the perspective of the CGSCC pass manager, these operations can affect
the SCC structure. Adding an edge might merge SCC's and deleting an edge
might split SCC's. Chandler mentioned that apparently the issues of
splitting and merging SCC's within the current infrastructure are actually
quite challenging and lead to e.g. iterator invalidation issues, and that
is what he is working on.

(
The ref graph is important to guide the overall SCC visitation order
because it basically represents "the largest graph that the CG may turn
into due to our static analysis of this module". I.e. no transformation we
can statically make in the CGSCC passes can ever cause us to need to merge
SCC's in the ref graph.
)

I have a couple overall questions/concerns:

1. The ref graph can easily go quadratic. E.g.

typedef void (*fp)();
fp funcs[] = {
  &foo1,
  &foo2,
  ...
  &fooN
}
void foo1() { funcs[something](); }
void foo2() { funcs[something](); }
...
void fooN() { funcs[something](); }

One real-world case where this might come about is in the presence of
vtables.

The existing CGSCC pass manager does not have this issue AFAIK because it
does not consider the ref graph.

Does anybody have any info/experience about how densely connected the ref
graph can get in programs that might reasonably be fed to the compiler?
I just did a quick sanity check with LLD/ELF using
`--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate
/ not run out of memory. Based on some rough calculations looking at the
profile, it seem like the entire run of the inliner in the old LTO pipeline
(which is about 5% of total LTO time on this particular example I looked
at) is only 2-3x as expensive as just
`--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction
is certainly not free.

2. What is the intended behavior of CGSCC passes when SCC's are split or
merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run
some function passes nested inside the CGSCC pass manager (e.g. to simplify
things after inlining). Consider:

a) These function passes are e.g. now able to devirtualize a call, adding
an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass
(say, the inliner) on this larger SCC?

b) These function passes are e.g. able to DCE a call, removing an edge from
the CG. This converts, say, a CG SCC which is a cycle graph (like
a->b->c->a) into a path graph (a->b->c, with no edge back to a). The
inliner had already visited a, b, and c as a single SCC. Now does it have
to re-visit c, then b, then a, as single-node SCC's?

btw:

One way that I have found it useful to think about this is in terms of the
visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in
https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm.
Inside the "strongconnect" routine when we have identified an SCC (the true
branch of `if (v.lowlink = v.index)` test ) we can visit
stack[v.index:stack.size()] as an SCC. This may or may not invalidate some
things on the stack (the variable `S` in the pseudocode) and we may need to
fix it up (e.g. inlining deleted a function, so we can't have an entry on
the stack). Then, we can run function passes as we pop individual functions
off the stack, but it is easier to think about IMO than merging of SCC data
structures: if we add edges to the CG then we have to do more DFS on the
new edges and if we delete edges then the DFS order of the stack gives us
certain guarantees.
Personally I find this much easier to reason about than the description in
terms of splitting and merging SCC's in the CG and ref graph (which the
LazyCallGraph API makes one to think about since it hides the underlying
Tarjan's algorithm).
The LazyCallGraph API makes the current loop in
http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100
very clean, but at least for my thinking about the problem, it seems like
the wrong abstraction (and most of the LazyCallGraph API seems to be
unused, so it seems like it may be overly heavyweight).
E.g. I think that maybe the easiest thing to do is to turn the current
approach inside out: instead of having the pass manager logic be the
"normal code" and forcing the Tarjan algorithm to become a state machine of
iterators, use an open-coded Tarjan algorithm with some callbacks and make
the pass management logic be the state machine.
This will also open the door to avoiding the potentially quadratic size of
the ref graph, since e.g. in the example I gave above, we can mark the
`funcs` array itself as already having been visited during the walk. In the
current LazyCallGraph, this would require adding some sort of notion of
hyperedge.

Since this is such a high priority (due to blocking PGO inlining), I will
probably try my hand at implementing the CGSCC pass manager sometime soon
unless somebody beats me to it. (I'll probably try the "open-coded SCC
visit" approach).

Another possibility is implementing the new CGSCC pass manager that uses
the same visitation semantics as the one in the old PM, and then we can
refactor that as needed. In fact, that may be the best approach so that
porting to the new PM is as NFC as possible and we can isolate the
functional (i.e., need benchmarks, measurements ...) changes in separate
commits.

Sorry for the wall of text.

-- Sean Silva
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/3ca09f70/attachment.html>