[llvm-dev] Intended behavior of CGSCC pass manager.

Wed Jun 8 16:36:48 PDT 2016

On Wed, Jun 8, 2016 at 9:32 AM, Hal Finkel <hfinkel at anl.gov> wrote:

>
> ------------------------------
>
> *From: *"Sean Silva via llvm-dev" <llvm-dev at lists.llvm.org>
> *To: *"llvm-dev" <llvm-dev at lists.llvm.org>
> *Sent: *Wednesday, June 8, 2016 6:19:03 AM
> *Subject: *[llvm-dev] Intended behavior of CGSCC pass manager.
>
> Hi Chandler, Philip, Mehdi, (and llvm-dev,)
>
> (this is partially a summary of some discussions that happened at the last
> LLVM bay area social, and partially a discussion about the direction of the
> CGSCC pass manager)
>
>
> A the last LLVM social we discussed the progress on the CGSCC pass
> manager. It seems like Chandler has a CGSCC pass manager working, but it is
> still unresolved exactly which semantics we want (more about this below)
> that are reasonably implementable.
>
> AFAICT, there has been no public discussion about what exact semantics we
> ultimately want to have. We should figure that out.
>
> The main difficulty which Chandler described is the apparently quite
> complex logic surrounding needing to run function passes nested within an
> SCC pass manager, while providing some guarantees about exactly what order
> the function passes are run. The existing CGSCC pass manager just punts on
> some of the problems that arise (look in CGPassManager::runOnModule,
> CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
> llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
> Chandler has been trying to solve.
>
> (
> Why is this "function passes inside CGSCC passes" stuff interesting?
> Because LLVM can do inlining on an SCC (often just a single function) and
> then run function passes to simplify the function(s) in the SCC before it
> tries to inline into a parent SCC. (the SCC visitation order is post-order)
> For example, we may inline a bunch of code, but after inlining we can
> tremendously simplify the function, and we want to do so before considering
> this function for inlining into its callers so that we get an accurate
> evaluation of the inline cost.
> Based on what Chandler said, it seems that LLVM is fairly unique in this
> regard and other compilers don't do this (which is why we can't just look
> at how other compilers solve this problem; they don't have this problem
> (maybe they should? or maybe we shouldn't?)). For example, he described
> that GCC uses different inlining "phases"; e.g. it does early inlining on
> the entire module, then does simplifications on the entire module, then
> does late inlining on the entire module; so it is not able to incrementally
> simplify as it inlines like LLVM does.
>
> This incremental simplification is an important feature of our inliner,
> and one we should endeavor to keep. We might also want different phases at
> some point (e.g. a top-down and a bottom-up phase), but that's another
> story.
>
> )
>
> As background for what is below, the LazyCallGraph tracks two graphs: the
> "call graph" and the "ref graph".
> Conceptually, the call graph is the graph of direct calls, where indirect
> calls and calls to external functions do not appear (or are connected to
> dummy nodes). The ref graph is basically the graph of all functions
> transitively accessible based on the globals/constants/etc. referenced by a
> function (e.g. if a function `foo` references a vtable that is defined in
> the module, there is an edge in the ref graph from `foo` to every function
> in the vtable).
> The call graph is a strict subset of the ref graph.
>
> Chandler described that he had a major breakthrough in that the CGSCC pass
> manager only had to deal with 3 classes of modifications that can occur:
> - a pass may e.g. propagate a load of a function pointer into an indirect
> call, turning it into an direct call. This requires adding an edge in the
> CG but not in the ref graph.
> - a pass may take a direct call and turn it into an indirect call. This
> requires removing an edge from the CG, but not in the ref graph.
> - a pass may delete a direct call. This removes an edge in the CG and also
> in the ref graph.
>
> From the perspective of the CGSCC pass manager, these operations can
> affect the SCC structure. Adding an edge might merge SCC's and deleting an
> edge might split SCC's. Chandler mentioned that apparently the issues of
> splitting and merging SCC's within the current infrastructure are actually
> quite challenging and lead to e.g. iterator invalidation issues, and that
> is what he is working on.
>
> (
> The ref graph is important to guide the overall SCC visitation order
> because it basically represents "the largest graph that the CG may turn
> into due to our static analysis of this module". I.e. no transformation we
> can statically make in the CGSCC passes can ever cause us to need to merge
> SCC's in the ref graph.
> )
>
>
>
> I have a couple overall questions/concerns:
>
>
> 1. The ref graph can easily go quadratic. E.g.
>
> typedef void (*fp)();
> fp funcs[] = {
>   &foo1,
>   &foo2,
>   ...
>   &fooN
> }
> void foo1() { funcs[something](); }
> void foo2() { funcs[something](); }
> ...
> void fooN() { funcs[something](); }
>
> One real-world case where this might come about is in the presence of
> vtables.
>
> The existing CGSCC pass manager does not have this issue AFAIK because it
> does not consider the ref graph.
>
> Does anybody have any info/experience about how densely connected the ref
> graph can get in programs that might reasonably be fed to the compiler?
> I just did a quick sanity check with LLD/ELF using
> `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate
> / not run out of memory. Based on some rough calculations looking at the
> profile, it seem like the entire run of the inliner in the old LTO pipeline
> (which is about 5% of total LTO time on this particular example I looked
> at) is only 2-3x as expensive as just
> `--lto-newpm-passes=cgscc(no-op-cgscc)`, so the LazyCallGraph construction
> is certainly not free.
>
>
> 2. What is the intended behavior of CGSCC passes when SCC's are split or
> merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run
> some function passes nested inside the CGSCC pass manager (e.g. to simplify
> things after inlining). Consider:
>
> This is not how I thought the current scheme worked ;) -- I was under the
> impression that we had a call graph with conservatively-connected dummy
> nodes for external/indirect functions. As a result, there is no
> semantics-preserving optimization that will merge SCCs, only split them. In
> that case, I'd expect that once an SCC is split, we re-run the CGSCC passes
> over the newly-separated SCCs. But this corresponds to running over the
> "ref graph", as you describe it. I don't understand why we want the
> non-conservative graph.
>
>
> a) These function passes are e.g. now able to devirtualize a call, adding
> an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC pass
> (say, the inliner) on this larger SCC?
>
> b) These function passes are e.g. able to DCE a call, removing an edge
> from the CG. This converts, say, a CG SCC which is a cycle graph (like
> a->b->c->a) into a path graph (a->b->c, with no edge back to a). The
> inliner had already visited a, b, and c as a single SCC. Now does it have
> to re-visit c, then b, then a, as single-node SCC's?
>
>
> btw:
>
> One way that I have found it useful to think about this is in terms of the
> visitation during Tarjan's SCC algorithm. I'll reference the pseudocode in
> https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm.
> Inside the "strongconnect" routine when we have identified an SCC (the true
> branch of `if (v.lowlink = v.index)` test ) we can visit
> stack[v.index:stack.size()] as an SCC. This may or may not invalidate some
> things on the stack (the variable `S` in the pseudocode) and we may need to
> fix it up (e.g. inlining deleted a function, so we can't have an entry on
> the stack). Then, we can run function passes as we pop individual functions
> off the stack, but it is easier to think about IMO than merging of SCC data
> structures: if we add edges to the CG then we have to do more DFS on the
> new edges and if we delete edges then the DFS order of the stack gives us
> certain guarantees.
> Personally I find this much easier to reason about than the description in
> terms of splitting and merging SCC's in the CG and ref graph (which the
> LazyCallGraph API makes one to think about since it hides the underlying
> Tarjan's algorithm).
> The LazyCallGraph API makes the current loop in
> http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100
> very clean, but at least for my thinking about the problem, it seems like
> the wrong abstraction (and most of the LazyCallGraph API seems to be
> unused, so it seems like it may be overly heavyweight).
> E.g. I think that maybe the easiest thing to do is to turn the current
> approach inside out: instead of having the pass manager logic be the
> "normal code" and forcing the Tarjan algorithm to become a state machine of
> iterators, use an open-coded Tarjan algorithm with some callbacks and make
> the pass management logic be the state machine.
> This will also open the door to avoiding the potentially quadratic size of
> the ref graph, since e.g. in the example I gave above, we can mark the
> `funcs` array itself as already having been visited during the walk. In the
> current LazyCallGraph, this would require adding some sort of notion of
> hyperedge.
>
> FWIW, I see no purpose in abstracting one algorithm, especially if that
> makes things algorithmically harder. Also, the LazyCallGraph abstraction
> and the iterator abstraction seem like separate issues. Iterator
> abstractions are often useful because you can use them in generic
> algorithms, etc.
>
>
> Since this is such a high priority (due to blocking PGO inlining), I will
> probably try my hand at implementing the CGSCC pass manager sometime soon
> unless somebody beats me to it. (I'll probably try the "open-coded SCC
> visit" approach).
>
> Another possibility is implementing the new CGSCC pass manager that uses
> the same visitation semantics as the one in the old PM, and then we can
> refactor that as needed. In fact, that may be the best approach so that
> porting to the new PM is as NFC as possible and we can isolate the
> functional (i.e., need benchmarks, measurements ...) changes in separate
> commits.
>
> I'm in favor of this approach for exactly the reason you mention. Being
> able to bisect regressions to the algorithmic change, separate from the
> infrastructure change, will likely make things easier in the long run (and
> will avoid the problem, to the extent possible, of performance regressions
> blocking the pass-manager work).
>

Yeah. Last night after I went home I was thinking to myself that this
really seems like the obvious path forward. I will start implementing this.

-- Sean Silva

>
>
> Sorry for the wall of text.
>
> No problem; much appreciated.
>
>  -Hal
>
>
> -- Sean Silva
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160608/7b29ab64/attachment-0001.html>