[llvm-dev] Intended behavior of CGSCC pass manager.
Sean Silva via llvm-dev
llvm-dev at lists.llvm.org
Thu Jun 16 04:48:30 PDT 2016
On Tue, Jun 14, 2016 at 11:29 PM, Xinliang David Li <davidxl at google.com>
wrote:
>
>
> On Thu, Jun 9, 2016 at 3:26 AM, Sean Silva <chisophugis at gmail.com> wrote:
>
>>
>>
>> On Wed, Jun 8, 2016 at 8:14 PM, Xinliang David Li <davidxl at google.com>
>> wrote:
>>
>>>
>>>
>>> On Wed, Jun 8, 2016 at 4:38 PM, Sean Silva <chisophugis at gmail.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Wed, Jun 8, 2016 at 12:31 PM, Xinliang David Li <davidxl at google.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 8, 2016 at 4:19 AM, Sean Silva <chisophugis at gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Chandler, Philip, Mehdi, (and llvm-dev,)
>>>>>>
>>>>>> (this is partially a summary of some discussions that happened at the
>>>>>> last LLVM bay area social, and partially a discussion about the direction
>>>>>> of the CGSCC pass manager)
>>>>>>
>>>>>>
>>>>>> A the last LLVM social we discussed the progress on the CGSCC pass
>>>>>> manager. It seems like Chandler has a CGSCC pass manager working, but it is
>>>>>> still unresolved exactly which semantics we want (more about this below)
>>>>>> that are reasonably implementable.
>>>>>>
>>>>>> AFAICT, there has been no public discussion about what exact
>>>>>> semantics we ultimately want to have. We should figure that out.
>>>>>>
>>>>>> The main difficulty which Chandler described is the apparently quite
>>>>>> complex logic surrounding needing to run function passes nested within an
>>>>>> SCC pass manager, while providing some guarantees about exactly what order
>>>>>> the function passes are run. The existing CGSCC pass manager just punts on
>>>>>> some of the problems that arise (look in CGPassManager::runOnModule,
>>>>>> CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
>>>>>> llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
>>>>>> Chandler has been trying to solve.
>>>>>>
>>>>>> (
>>>>>> Why is this "function passes inside CGSCC passes" stuff interesting?
>>>>>> Because LLVM can do inlining on an SCC (often just a single function) and
>>>>>> then run function passes to simplify the function(s) in the SCC before it
>>>>>> tries to inline into a parent SCC. (the SCC visitation order is post-order)
>>>>>> For example, we may inline a bunch of code, but after inlining we can
>>>>>> tremendously simplify the function, and we want to do so before considering
>>>>>> this function for inlining into its callers so that we get an accurate
>>>>>> evaluation of the inline cost.
>>>>>> Based on what Chandler said, it seems that LLVM is fairly unique in
>>>>>> this regard and other compilers don't do this (which is why we can't just
>>>>>> look at how other compilers solve this problem; they don't have this
>>>>>> problem (maybe they should? or maybe we shouldn't?)). For example, he
>>>>>> described that GCC uses different inlining "phases"; e.g. it does early
>>>>>> inlining on the entire module, then does simplifications on the entire
>>>>>> module, then does late inlining on the entire module; so it is not able to
>>>>>> incrementally simplify as it inlines like LLVM does.
>>>>>> )
>>>>>>
>>>>>> As background for what is below, the LazyCallGraph tracks two graphs:
>>>>>> the "call graph" and the "ref graph".
>>>>>> Conceptually, the call graph is the graph of direct calls, where
>>>>>> indirect calls and calls to external functions do not appear (or are
>>>>>> connected to dummy nodes). The ref graph is basically the graph of all
>>>>>> functions transitively accessible based on the globals/constants/etc.
>>>>>> referenced by a function (e.g. if a function `foo` references a vtable that
>>>>>> is defined in the module, there is an edge in the ref graph from `foo` to
>>>>>> every function in the vtable).
>>>>>> The call graph is a strict subset of the ref graph.
>>>>>>
>>>>>> Chandler described that he had a major breakthrough in that the CGSCC
>>>>>> pass manager only had to deal with 3 classes of modifications that can
>>>>>> occur:
>>>>>> - a pass may e.g. propagate a load of a function pointer into an
>>>>>> indirect call, turning it into an direct call. This requires adding an edge
>>>>>> in the CG but not in the ref graph.
>>>>>> - a pass may take a direct call and turn it into an indirect call.
>>>>>> This requires removing an edge from the CG, but not in the ref graph.
>>>>>> - a pass may delete a direct call. This removes an edge in the CG and
>>>>>> also in the ref graph.
>>>>>>
>>>>>> From the perspective of the CGSCC pass manager, these operations can
>>>>>> affect the SCC structure. Adding an edge might merge SCC's and deleting an
>>>>>> edge might split SCC's. Chandler mentioned that apparently the issues of
>>>>>> splitting and merging SCC's within the current infrastructure are actually
>>>>>> quite challenging and lead to e.g. iterator invalidation issues, and that
>>>>>> is what he is working on.
>>>>>>
>>>>>> (
>>>>>> The ref graph is important to guide the overall SCC visitation order
>>>>>> because it basically represents "the largest graph that the CG may turn
>>>>>> into due to our static analysis of this module". I.e. no transformation we
>>>>>> can statically make in the CGSCC passes can ever cause us to need to merge
>>>>>> SCC's in the ref graph.
>>>>>> )
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have a couple overall questions/concerns:
>>>>>>
>>>>>>
>>>>>> 1. The ref graph can easily go quadratic. E.g.
>>>>>>
>>>>>> typedef void (*fp)();
>>>>>> fp funcs[] = {
>>>>>> &foo1,
>>>>>> &foo2,
>>>>>> ...
>>>>>> &fooN
>>>>>> }
>>>>>> void foo1() { funcs[something](); }
>>>>>> void foo2() { funcs[something](); }
>>>>>> ...
>>>>>> void fooN() { funcs[something](); }
>>>>>>
>>>>>> One real-world case where this might come about is in the presence of
>>>>>> vtables.
>>>>>>
>>>>>> The existing CGSCC pass manager does not have this issue AFAIK
>>>>>> because it does not consider the ref graph.
>>>>>>
>>>>>> Does anybody have any info/experience about how densely connected the
>>>>>> ref graph can get in programs that might reasonably be fed to the compiler?
>>>>>> I just did a quick sanity check with LLD/ELF using
>>>>>> `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate
>>>>>> / not run out of memory. Based on some rough calculations looking at the
>>>>>> profile, it seem like the entire run of the inliner in the old LTO pipeline
>>>>>> (which is about 5% of total LTO time on this particular example I looked
>>>>>> at) is only 2-3x as expensive as just
>>>>>> `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction
>>>>>> is certainly not free.
>>>>>>
>>>>>
>>>>>
>>>>> Conceptually, reference graph should also include variable nodes.
>>>>> With variable nodes introduced, the quadratic behavior mentioned can be
>>>>> avoided.
>>>>>
>>>>
>>>> Yeah, this is what I was trying to get at with the statement "In the
>>>> current LazyCallGraph, this would require adding some sort of notion of
>>>> hyperedge."
>>>> But you are right: from an implementation perspective of a call graph
>>>> data structure that is trying to model the "ref graph" of LazyCallGraph, it
>>>> is cleaner to have nodes that are not functions than to introduce a notion
>>>> of hyperedge.
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> 2. What is the intended behavior of CGSCC passes when SCC's are split
>>>>>> or merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run
>>>>>> some function passes nested inside the CGSCC pass manager (e.g. to simplify
>>>>>> things after inlining). Consider:
>>>>>>
>>>>>> a) These function passes are e.g. now able to devirtualize a call,
>>>>>> adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC
>>>>>> pass (say, the inliner) on this larger SCC?
>>>>>>
>>>>>> b) These function passes are e.g. able to DCE a call, removing an
>>>>>> edge from the CG. This converts, say, a CG SCC which is a cycle graph (like
>>>>>> a->b->c->a) into a path graph (a->b->c, with no edge back to a). The
>>>>>> inliner had already visited a, b, and c as a single SCC. Now does it have
>>>>>> to re-visit c, then b, then a, as single-node SCC's?
>>>>>>
>>>>>>
>>>>>> btw:
>>>>>>
>>>>>> One way that I have found it useful to think about this is in terms
>>>>>> of the visitation during Tarjan's SCC algorithm. I'll reference the
>>>>>> pseudocode in
>>>>>> https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm.
>>>>>> Inside the "strongconnect" routine when we have identified an SCC (the true
>>>>>> branch of `if (v.lowlink = v.index)` test ) we can visit
>>>>>> stack[v.index:stack.size()] as an SCC. This may or may not invalidate some
>>>>>> things on the stack (the variable `S` in the pseudocode) and we may need to
>>>>>> fix it up (e.g. inlining deleted a function, so we can't have an entry on
>>>>>> the stack). Then, we can run function passes as we pop individual functions
>>>>>> off the stack, but it is easier to think about IMO than merging of SCC data
>>>>>> structures: if we add edges to the CG then we have to do more DFS on the
>>>>>> new edges and if we delete edges then the DFS order of the stack gives us
>>>>>> certain guarantees.
>>>>>> Personally I find this much easier to reason about than the
>>>>>> description in terms of splitting and merging SCC's in the CG and ref graph
>>>>>> (which the LazyCallGraph API makes one to think about since it hides the
>>>>>> underlying Tarjan's algorithm).
>>>>>> The LazyCallGraph API makes the current loop in
>>>>>> http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100
>>>>>> very clean, but at least for my thinking about the problem, it seems like
>>>>>> the wrong abstraction (and most of the LazyCallGraph API seems to be
>>>>>> unused, so it seems like it may be overly heavyweight).
>>>>>> E.g. I think that maybe the easiest thing to do is to turn the
>>>>>> current approach inside out: instead of having the pass manager logic be
>>>>>> the "normal code" and forcing the Tarjan algorithm to become a state
>>>>>> machine of iterators, use an open-coded Tarjan algorithm with some
>>>>>> callbacks and make the pass management logic be the state machine.
>>>>>> This will also open the door to avoiding the potentially quadratic
>>>>>> size of the ref graph, since e.g. in the example I gave above, we can mark
>>>>>> the `funcs` array itself as already having been visited during the walk. In
>>>>>> the current LazyCallGraph, this would require adding some sort of notion of
>>>>>> hyperedge.
>>>>>>
>>>>>> Since this is such a high priority (due to blocking PGO inlining), I
>>>>>> will probably try my hand at implementing the CGSCC pass manager sometime
>>>>>> soon unless somebody beats me to it. (I'll probably try the "open-coded SCC
>>>>>> visit" approach).
>>>>>>
>>>>>> Another possibility is implementing the new CGSCC pass manager that
>>>>>> uses the same visitation semantics as the one in the old PM, and then we
>>>>>> can refactor that as needed. In fact, that may be the best approach so that
>>>>>> porting to the new PM is as NFC as possible and we can isolate the
>>>>>> functional (i.e., need benchmarks, measurements ...) changes in separate
>>>>>> commits.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> A very high level comment: why do we need to update callgraph on the
>>>>> fly ? Can we have a more general support of iterative SCC pass invocation?
>>>>>
>>>>> something like:
>>>>>
>>>>> 1) build the callgraph
>>>>> 2) cache the post-order traversal order
>>>>>
>>>>> 3) if the order list is empty -- done
>>>>> 4) traversal: invoke function passes for each function on the order
>>>>> (step 2 or 5). The call graph gets updated on the fly (with new edges, or
>>>>> new nodes for cloned functions)
>>>>> 5) update the function traversal order from new nodes and new edges
>>>>> created in 4)
>>>>> 6) go to step 3).
>>>>>
>>>>
>>>> (sorry for the delayed reply... this is a very poignant question /
>>>> example)
>>>>
>>>> From the discussion with Chandler, I think he wants to provide more
>>>> guarantees to function passes about the visitation order. He will need to
>>>> explain his exact concerns. But IIRC the essence of one of the issues is
>>>> captured in the example I gave in 2.b) in the OP:
>>>>
>>>> "These function passes are e.g. able to DCE a call, removing an edge
>>>> from the CG. This converts, say, a CG SCC which is a cycle graph (like
>>>> a->b->c->a) into a path graph (a->b->c, with no edge back to a)."
>>>>
>>>> The issue that I remember Chandler brought up is that before deleting
>>>> an edge from an SCC, the nodes in the SCC are "unordered" w.r.t. each
>>>> other; but after deleting an edge from the CG which splits the SCC, there
>>>> is an order.
>>>> In other words, that the cached post-order may no longer be a
>>>> post-order traversal of the new graph after a function pass runs. For
>>>> example, in the cached post-order traversal we may have entered the SCC
>>>> `a->b->c->a` via an edge `x->b` and so our cached post-order traversal
>>>> visits the functions in the order `a, then c, then b`. If a function pass
>>>> visits `c` and deletes the call `c->a`, then "a, then c, then b" is not a
>>>> valid post-order for visiting the functions.
>>>>
>>>> Looking at the explanation I gave, I think there isn't really a problem
>>>> here. The cached post-order can only be invalidated for functions that have
>>>> already been visited.
>>>>
>>>
>>> If we merge CGSCC based transformation passes with an IPA analysis pass
>>> that requires strict/valid bottom up order, then updating the SCCs on the
>>> fly might be needed (or not if the result will still be conservatively
>>> correct) -- but interleaving IPA analyses with IPA transformations like
>>> this is not the right thing to do also for other reasons.
>>>
>>> Assuming CGSCC pass just does a predefined set of transformations at
>>> function level with some order, there is no issue of 'invalidation' -- the
>>> cached order will always be valid. In 2.b), why bother to revisit the
>>> nodes?
>>>
>>
>> Indeed. That is the type of doubt for why I was asking this question in
>> the OP.
>>
>>>
>>>
>>>> Speaking more broadly about the algorithm you just described, did you
>>>> intend to omit an SCC visitation step?
>>>>
>>>
>>> It is independent of this decision. However we can actually go back one
>>> step and ask the question: is adding this layer really necessary? Why not
>>> just traversing the cgraph nodes in reverse topo-order (after removing the
>>> cycles)?
>>>
>>>
>>>> The goal of the CGSCC pass manager is the ability to visit an SCC (e.g.
>>>> inliner visits an SCC), then immediately run function passes to simplify
>>>> the result of inlining.
>>>>
>>>
>>> This is how current implementation is. Is it a fundamental requirement?
>>>
>>
>> I think this is a good question. Hopefully Chandler will be able to reply
>> to this thread. From what I can tell, the exact requirements have never
>> really been discussed.
>>
>>
>>>
>>>
>>>
>>>> Only after the simplification has occurred do we visit the parent SCC.
>>>> By running the simplifications in lock-step with the inliner SCC visitation
>>>> we give parent SCC's a more accurate view of the real cost of inlining a
>>>> function from a child SCC.
>>>>
>>>
>>> This works for all bottom up scheme regardless of a SCC layer is added
>>> or not.
>>>
>>>
>>>>
>>>> Issues similar to 2.a) (i.e. adding edges to the CG) also affect the
>>>> visitation order of SCC's (not just functions). For example, we visit an
>>>> SCC with a CGSCC pass (e.g. inliner). Then we run the first function pass
>>>> on that SCC, which may add an edge (e.g. promote indirect call to direct
>>>> call) that may enlarge the SCC. Do we continue running the remaining
>>>> function passes? Do we re-visit this new enlarged SCC? (if so, when?) These
>>>> are the types of questions that motivated this post (hence the name
>>>> "Intended behavior of CGSCC pass manager.").
>>>>
>>>> yes, I understand the motivation -- that is way I propose the agorithm
>>> that uses cached iteration order + worklist based iterative approach. In
>>> your example, when a new edge is exposed via devirtualization, only its
>>> caller/ancestor nodes need to be revisited in the next iteration.
>>>
>>
>> Your proposal certainly sounds simpler and easier to reason about. (e.g.
>> it naturally avoids any issues with deletion)
>>
>> Since there are only 3 non-inliner CGSCC passes (ArgPromotion,
>> PostOrderFunctionAttrsLegacyPass, PruneEH) it may be feasible to remove
>> the entire abstraction from LLVM and replace it with a cached post-order
>> function pass visitation like you suggest. The inliner doesn't seem to do
>> anything special with the knowledge that it is visiting an SCC (besides
>> moving call sites that call within the SCC to the end of its worklist) and
>> so this may be fine.
>>
>>
> A) Using cached callgraph walk (nodes or SCC) to avoid cgraph invalidation
> and B) whether SCC based walk has more advantages are two different things
> to answer.
>
> For the later, I now tend to believe that introducing additional SCC layer
> has more advantages than alternative that uses node based bottom-up walk.
> 1) Bottom-up based IPA algorithms can be implemented more cleanly by using
> SCC. For instance the pruneEH or any bottom-up attribute propagations.
> Without using SCC, we will need to defined a generic (using template)
> worklist based bottom-up propagation engine.
> 2) All SCC passes can be nicely grouped together -- for compile time
> reasons or to create synergies for better performance.
>
>
> For A), to repeat myself, SCC DAG caching to avoid mutation for one round
> of CG walk has many advantages. We should probably discuss more on
> pros/cons and various scenarios we want to handle ..
>
One question is what invariants we want to provide for the visitation.
For example, should a CGSCC pass be able to assume that all "child" SCC's
(SCC's it can reach via direct calls emanating from the SCC being visited)
have already been visited? Currently I don't think it can, and IIRC from
the discussion at the social this is one thing that Chandler is hoping to
fix. The "ref edge" notion in LazyCallGraph ensures that we cannot create a
call edge (devirtualizing a ref edge) that will point at an SCC that has
not yet been visited.
E.g. consider this graph:
digraph G {
A -> B; B -> A; // SCC {A,B}
S -> T; T -> S; // SCC {S,T}
X -> Y; Y -> X; // SCC {X,Y}
B -> X;
B -> S;
T -> Y [label="Ref edge that is devirtualized\nwhen visiting SCC
{S,T}",constraint=false,style=dashed]
}
(can visualize conveniently at http://www.webgraphviz.com/ or I have put an
image at http://reviews.llvm.org/F2073104)
If we do not consider the ref graph, then it is possible for SCC {S,T} to
be visited before SCC {X,Y}. So after devirtualizing the call T->Y we can
no longer assume that the SCC's are visited in post-order (or must somehow
try to ignore that the call edge T->Y exists).
A more complicated case is when SCC {S,T} and SCC {X,Y} both call into each
other via function pointers. So eventually after devirtualizing the calls
in both directions there will be a single SCC {S,T,X,Y}.
digraph G {
A -> B; B -> A; // SCC {A,B}
S -> T; T -> S; // SCC {S,T}
X -> Y; Y -> X; // SCC {X,Y}
B -> X;
B -> S;
T -> Y [label="Ref edge that is devirtualized\nwhen visiting SCC
{S,T}",constraint=false,style=dashed]
X -> S [label="Ref edge that is devirtualized\nwhen visiting SCC
{X,Y}",constraint=false,style=dashed]
}
(rendering at: http://reviews.llvm.org/F2073479)
Due to the cyclic dependence there is no SCC visitation order that can
directly provide the invariant above. Indeed, the problem of maintaining
the above invariant is ill-posed in this scenario.
Consider the pipeline `cgscc(function(...simplifications that can
devirtualize...),foo-cgscc-pass)`. A possible visitation is as follows:
1. Visit SCC {S,T} and run `function(...simplifications that can
devirtualize...)`. This reveals the call edge T->Y.
2. We continue visiting SCC {S,T} and run foo-cgscc-pass on SCC {S,T}.
3. Visit SCC {X,Y} and run `function(...simplifications that can
devirtualize...)`. This reveals the call edge X->S.
4. ??? what do we do now.
Alternative 4.a) Should we continue the visitation and call foo-cgscc-pass
on "SCC" {X,Y}?
Alternative 4.b) Should foo-cgscc-pass now run on SCC {S,T,X,Y}?
Alternative 4.c) Should we restart the entire outer `cgscc(...)` visitation
on SCC {S,T,X,Y}?
(Without a cap both 4.b and 4.c could become quadratic on a graph like
http://reviews.llvm.org/F2073607)
-- Sean Silva
>
> thanks,
>
> David
>
>
>
>
>> Sean:~/pg/llvm % git grep 'public CallGraphSCCPass'
>> include/llvm/Transforms/IPO/InlinerPass.h:struct Inliner : public
>> CallGraphSCCPass {
>> lib/Transforms/IPO/ArgumentPromotion.cpp: struct ArgPromotion : public
>> CallGraphSCCPass {
>> lib/Transforms/IPO/FunctionAttrs.cpp:struct
>> PostOrderFunctionAttrsLegacyPass : public CallGraphSCCPass {
>> lib/Transforms/IPO/PruneEH.cpp: struct PruneEH : public CallGraphSCCPass
>> {
>> lib/Analysis/CallGraphSCCPass.cpp: class PrintCallGraphPass : public
>> CallGraphSCCPass {
>> tools/opt/PassPrinters.cpp:struct CallGraphSCCPassPrinter : public
>> CallGraphSCCPass {
>>
>> CGSCC passes seem to have been added in what is now SVN r8247 (~Aug 2003)
>> http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20030825/006619.html
>> (LLVM appears to have been in CVS at the time).
>>
>> Chris, do you remember the motivation for doing the CGSCC visitation
>> instead of a pure post-order function visitation like David is mentioning?
>> (or was it just an oversight / hindsight-20-20 thing?) Do you think it
>> would make sense to replace CGSCC visitation with post-order function
>> visitation in the current LLVM?
>>
>> -- Sean Silva
>>
>>
>>>
>>> David
>>>
>>>
>>>> -- Sean Silva
>>>>
>>>>
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Sorry for the wall of text.
>>>>>>
>>>>>>
>>>>>> -- Sean Silva
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160616/d64591d5/attachment-0001.html>
More information about the llvm-dev
mailing list