[llvm-dev] Intended behavior of CGSCC pass manager.

Wed Jun 8 20:14:59 PDT 2016

On Wed, Jun 8, 2016 at 4:38 PM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Wed, Jun 8, 2016 at 12:31 PM, Xinliang David Li <davidxl at google.com>
> wrote:
>
>>
>>
>> On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chisophugis at gmail.com> wrote:
>>
>>> Hi Chandler, Philip, Mehdi, (and llvm-dev,)
>>>
>>> (this is partially a summary of some discussions that happened at the
>>> last LLVM bay area social, and partially a discussion about the direction
>>> of the CGSCC pass manager)
>>>
>>>
>>> A the last LLVM social we discussed the progress on the CGSCC pass
>>> manager. It seems like Chandler has a CGSCC pass manager working, but it is
>>> still unresolved exactly which semantics we want (more about this below)
>>> that are reasonably implementable.
>>>
>>> AFAICT, there has been no public discussion about what exact semantics
>>> we ultimately want to have. We should figure that out.
>>>
>>> The main difficulty which Chandler described is the apparently quite
>>> complex logic surrounding needing to run function passes nested within an
>>> SCC pass manager, while providing some guarantees about exactly what order
>>> the function passes are run. The existing CGSCC pass manager just punts on
>>> some of the problems that arise (look in CGPassManager::runOnModule,
>>> CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
>>> llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
>>> Chandler has been trying to solve.
>>>
>>> (
>>> Why is this "function passes inside CGSCC passes" stuff interesting?
>>> Because LLVM can do inlining on an SCC (often just a single function) and
>>> then run function passes to simplify the function(s) in the SCC before it
>>> tries to inline into a parent SCC. (the SCC visitation order is post-order)
>>> For example, we may inline a bunch of code, but after inlining we can
>>> tremendously simplify the function, and we want to do so before considering
>>> this function for inlining into its callers so that we get an accurate
>>> evaluation of the inline cost.
>>> Based on what Chandler said, it seems that LLVM is fairly unique in this
>>> regard and other compilers don't do this (which is why we can't just look
>>> at how other compilers solve this problem; they don't have this problem
>>> (maybe they should? or maybe we shouldn't?)). For example, he described
>>> that GCC uses different inlining "phases"; e.g. it does early inlining on
>>> the entire module, then does simplifications on the entire module, then
>>> does late inlining on the entire module; so it is not able to incrementally
>>> simplify as it inlines like LLVM does.
>>> )
>>>
>>> As background for what is below, the LazyCallGraph tracks two graphs:
>>> the "call graph" and the "ref graph".
>>> Conceptually, the call graph is the graph of direct calls, where
>>> indirect calls and calls to external functions do not appear (or are
>>> connected to dummy nodes). The ref graph is basically the graph of all
>>> functions transitively accessible based on the globals/constants/etc.
>>> referenced by a function (e.g. if a function `foo` references a vtable that
>>> is defined in the module, there is an edge in the ref graph from `foo` to
>>> every function in the vtable).
>>> The call graph is a strict subset of the ref graph.
>>>
>>> Chandler described that he had a major breakthrough in that the CGSCC
>>> pass manager only had to deal with 3 classes of modifications that can
>>> occur:
>>> - a pass may e.g. propagate a load of a function pointer into an
>>> indirect call, turning it into an direct call. This requires adding an edge
>>> in the CG but not in the ref graph.
>>> - a pass may take a direct call and turn it into an indirect call. This
>>> requires removing an edge from the CG, but not in the ref graph.
>>> - a pass may delete a direct call. This removes an edge in the CG and
>>> also in the ref graph.
>>>
>>> From the perspective of the CGSCC pass manager, these operations can
>>> affect the SCC structure. Adding an edge might merge SCC's and deleting an
>>> edge might split SCC's. Chandler mentioned that apparently the issues of
>>> splitting and merging SCC's within the current infrastructure are actually
>>> quite challenging and lead to e.g. iterator invalidation issues, and that
>>> is what he is working on.
>>>
>>> (
>>> The ref graph is important to guide the overall SCC visitation order
>>> because it basically represents "the largest graph that the CG may turn
>>> into due to our static analysis of this module". I.e. no transformation we
>>> can statically make in the CGSCC passes can ever cause us to need to merge
>>> SCC's in the ref graph.
>>> )
>>>
>>>
>>>
>>> I have a couple overall questions/concerns:
>>>
>>>
>>> 1. The ref graph can easily go quadratic. E.g.
>>>
>>> typedef void (*fp)();
>>> fp funcs[] = {
>>>   &foo1,
>>>   &foo2,
>>>   ...
>>>   &fooN
>>> }
>>> void foo1() { funcs[something](); }
>>> void foo2() { funcs[something](); }
>>> ...
>>> void fooN() { funcs[something](); }
>>>
>>> One real-world case where this might come about is in the presence of
>>> vtables.
>>>
>>> The existing CGSCC pass manager does not have this issue AFAIK because
>>> it does not consider the ref graph.
>>>
>>> Does anybody have any info/experience about how densely connected the
>>> ref graph can get in programs that might reasonably be fed to the compiler?
>>> I just did a quick sanity check with LLD/ELF using
>>> `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate
>>> / not run out of memory. Based on some rough calculations looking at the
>>> profile, it seem like the entire run of the inliner in the old LTO pipeline
>>> (which is about 5% of total LTO time on this particular example I looked
>>> at) is only 2-3x as expensive as just
>>> `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction
>>> is certainly not free.
>>>
>>
>>
>> Conceptually, reference graph should also include  variable nodes. With
>> variable nodes introduced, the quadratic behavior mentioned can be avoided.
>>
>
> Yeah, this is what I was trying to get at with the statement "In the
> current LazyCallGraph, this would require adding some sort of notion of
> hyperedge."
> But you are right: from an implementation perspective of a call graph data
> structure that is trying to model the "ref graph" of LazyCallGraph, it is
> cleaner to have nodes that are not functions than to introduce a notion of
> hyperedge.
>
>
>
>>
>>
>>
>>
>>>
>>>
>>> 2. What is the intended behavior of CGSCC passes when SCC's are split or
>>> merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run
>>> some function passes nested inside the CGSCC pass manager (e.g. to simplify
>>> things after inlining). Consider:
>>>
>>> a) These function passes are e.g. now able to devirtualize a call,
>>> adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC
>>> pass (say, the inliner) on this larger SCC?
>>>
>>> b) These function passes are e.g. able to DCE a call, removing an edge
>>> from the CG. This converts, say, a CG SCC which is a cycle graph (like
>>> a->b->c->a) into a path graph (a->b->c, with no edge back to a). The
>>> inliner had already visited a, b, and c as a single SCC. Now does it have
>>> to re-visit c, then b, then a, as single-node SCC's?
>>>
>>>
>>> btw:
>>>
>>> One way that I have found it useful to think about this is in terms of
>>> the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode
>>> in
>>> https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm.
>>> Inside the "strongconnect" routine when we have identified an SCC (the true
>>> branch of `if (v.lowlink = v.index)` test ) we can visit
>>> stack[v.index:stack.size()] as an SCC. This may or may not invalidate some
>>> things on the stack (the variable `S` in the pseudocode) and we may need to
>>> fix it up (e.g. inlining deleted a function, so we can't have an entry on
>>> the stack). Then, we can run function passes as we pop individual functions
>>> off the stack, but it is easier to think about IMO than merging of SCC data
>>> structures: if we add edges to the CG then we have to do more DFS on the
>>> new edges and if we delete edges then the DFS order of the stack gives us
>>> certain guarantees.
>>> Personally I find this much easier to reason about than the description
>>> in terms of splitting and merging SCC's in the CG and ref graph (which the
>>> LazyCallGraph API makes one to think about since it hides the underlying
>>> Tarjan's algorithm).
>>> The LazyCallGraph API makes the current loop in
>>> http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100
>>> very clean, but at least for my thinking about the problem, it seems like
>>> the wrong abstraction (and most of the LazyCallGraph API seems to be
>>> unused, so it seems like it may be overly heavyweight).
>>> E.g. I think that maybe the easiest thing to do is to turn the current
>>> approach inside out: instead of having the pass manager logic be the
>>> "normal code" and forcing the Tarjan algorithm to become a state machine of
>>> iterators, use an open-coded Tarjan algorithm with some callbacks and make
>>> the pass management logic be the state machine.
>>> This will also open the door to avoiding the potentially quadratic size
>>> of the ref graph, since e.g. in the example I gave above, we can mark the
>>> `funcs` array itself as already having been visited during the walk. In the
>>> current LazyCallGraph, this would require adding some sort of notion of
>>> hyperedge.
>>>
>>> Since this is such a high priority (due to blocking PGO inlining), I
>>> will probably try my hand at implementing the CGSCC pass manager sometime
>>> soon unless somebody beats me to it. (I'll probably try the "open-coded SCC
>>> visit" approach).
>>>
>>> Another possibility is implementing the new CGSCC pass manager that uses
>>> the same visitation semantics as the one in the old PM, and then we can
>>> refactor that as needed. In fact, that may be the best approach so that
>>> porting to the new PM is as NFC as possible and we can isolate the
>>> functional (i.e., need benchmarks, measurements ...) changes in separate
>>> commits.
>>>
>>>
>>
>>
>> A very high level comment: why do we need to update callgraph on the fly
>> ? Can we have a more general support of iterative SCC pass invocation?
>>
>> something like:
>>
>> 1) build the callgraph
>> 2) cache the post-order traversal order
>>
>> 3) if the order list is empty -- done
>> 4) traversal: invoke function passes for each function on the order (step
>> 2 or 5). The call graph gets updated on the fly (with new edges, or new
>> nodes for cloned functions)
>> 5) update the function traversal order from new nodes and new edges
>> created in 4)
>> 6) go to step 3).
>>
>
> (sorry for the delayed reply... this is a very poignant question / example)
>
> From the discussion with Chandler, I think he wants to provide more
> guarantees to function passes about the visitation order. He will need to
> explain his exact concerns. But IIRC the essence of one of the issues is
> captured in the example I gave in 2.b) in the OP:
>
> "These function passes are e.g. able to DCE a call, removing an edge from
> the CG. This converts, say, a CG SCC which is a cycle graph (like
> a->b->c->a) into a path graph (a->b->c, with no edge back to a)."
>
> The issue that I remember Chandler brought up is that before deleting an
> edge from an SCC, the nodes in the SCC are "unordered" w.r.t. each other;
> but after deleting an edge from the CG which splits the SCC, there is an
> order.
> In other words, that the cached post-order may no longer be a post-order
> traversal of the new graph after a function pass runs. For example, in the
> cached post-order traversal we may have entered the SCC `a->b->c->a` via an
> edge `x->b` and so our cached post-order traversal visits the functions in
> the order `a, then c, then b`. If a function pass visits `c` and deletes
> the call `c->a`, then "a, then c, then b" is not a valid post-order for
> visiting the functions.
>
> Looking at the explanation I gave, I think there isn't really a problem
> here. The cached post-order can only be invalidated for functions that have
> already been visited.
>

If we merge CGSCC based transformation passes with an IPA analysis pass
that requires strict/valid bottom up order, then updating the SCCs on the
fly might be needed (or not if the result will still be conservatively
correct) -- but  interleaving IPA analyses with IPA transformations like
this is not the right thing to do also for other reasons.

Assuming CGSCC pass just does a predefined set of transformations at
function level with some order, there is no issue of 'invalidation' -- the
cached order will always be valid. In 2.b),  why bother to revisit the
nodes?

> Speaking more broadly about the algorithm you just described, did you
> intend to omit an SCC visitation step?
>

It is independent of this decision.  However we can actually go back one
step and ask the question: is adding this layer really necessary? Why not
just traversing the cgraph nodes in reverse topo-order (after removing the
cycles)?

> The goal of the CGSCC pass manager is the ability to visit an SCC (e.g.
> inliner visits an SCC), then immediately run function passes to simplify
> the result of inlining.
>

This is how current implementation is. Is it a fundamental requirement?

> Only after the simplification has occurred do we visit the parent SCC. By
> running the simplifications in lock-step with the inliner SCC visitation we
> give parent SCC's a more accurate view of the real cost of inlining a
> function from a child SCC.
>

This works for all bottom up scheme regardless of a SCC layer is added or
not.

>
> Issues similar to 2.a) (i.e. adding edges to the CG) also affect the
> visitation order of SCC's (not just functions). For example, we visit an
> SCC with a CGSCC pass (e.g. inliner). Then we run the first function pass
> on that SCC, which may add an edge (e.g. promote indirect call to direct
> call) that may enlarge the SCC. Do we continue running the remaining
> function passes? Do we re-visit this new enlarged SCC? (if so, when?) These
> are the types of questions that motivated this post (hence the name
> "Intended behavior of CGSCC pass manager.").
>
> yes, I understand the motivation -- that is way I propose the agorithm
that uses cached iteration order + worklist based iterative approach. In
your example, when a new edge is exposed via devirtualization, only its
caller/ancestor nodes need to be revisited in the next iteration.

David

> -- Sean Silva
>
>
>>
>> David
>>
>>
>>
>>>
>>> Sorry for the wall of text.
>>>
>>>
>>> -- Sean Silva
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/df7490cb/attachment.html>