[llvm-dev] Intended behavior of CGSCC pass manager.

Wed Jun 8 12:36:18 PDT 2016

Hi,

Does it make sense to change RefSCCs to hold a list of
RefSCC-DAG-Roots that were split out of it because of edge deletion?
Then one way to phrase the inliner/function pass iteration would be
(assuming I understand the issues):

  Stack.push(RefSCC_Leaves);
  while (!Stack.empty()) {
    RefSCC = Stack.pop();
    InlineCallSites(RefSCC);
    if (!RefSCC.splitOutSCCs.empty())
      goto repush;
    for each func in RefSCC:
      FPM.run(func);
      if (!RefSCC.splitOutSCCs.empty())
        goto repush;
    continue;
repush:
     for (refscc_dag_root in RefSCCs.splitOutSCCs)
       // here we don't want to push every leaf, but leafs that
       // have functions that haven't had the FPM run on (maybe we can
do this by maintaining a set?)
       // if we don't push a leaf, we push its parent (which we want
to push even if we've run FPM on it
       // since we'd like to re-run the inliner on it).
       refscc_dag_root.push_leaves_to(Stack)
   }

(I know this isn't ideal, since now RefSCC is no longer "just a data
structure", but is actually has incidental information.)

On Wed, Jun 8, 2016 at 12:07 PM, Finkel, Hal J. via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
> Sent from my Verizon Wireless 4G LTE DROID
>
> On Jun 8, 2016 1:58 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>>
>>
>>> On Jun 8, 2016, at 9:32 AM, Hal Finkel via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>>
>>>
>>> ________________________________
>>>>
>>>> From: "Sean Silva via llvm-dev" <llvm-dev at lists.llvm.org>
>>>> To: "llvm-dev" <llvm-dev at lists.llvm.org>
>>>> Sent: Wednesday, June 8, 2016 6:19:03 AM
>>>> Subject: [llvm-dev] Intended behavior of CGSCC pass manager.
>>>>
>>>> Hi Chandler, Philip, Mehdi, (and llvm-dev,)
>>>>
>>>> (this is partially a summary of some discussions that happened at the
>>>> last LLVM bay area social, and partially a discussion about the direction of
>>>> the CGSCC pass manager)
>>>>
>>>>
>>>> A the last LLVM social we discussed the progress on the CGSCC pass
>>>> manager. It seems like Chandler has a CGSCC pass manager working, but it is
>>>> still unresolved exactly which semantics we want (more about this below)
>>>> that are reasonably implementable.
>>>>
>>>> AFAICT, there has been no public discussion about what exact semantics
>>>> we ultimately want to have. We should figure that out.
>>>>
>>>> The main difficulty which Chandler described is the apparently quite
>>>> complex logic surrounding needing to run function passes nested within an
>>>> SCC pass manager, while providing some guarantees about exactly what order
>>>> the function passes are run. The existing CGSCC pass manager just punts on
>>>> some of the problems that arise (look in CGPassManager::runOnModule,
>>>> CGPassManager::RunAllPassesOnSCC, and CGPassManager::RunPassOnSCC in
>>>> llvm/lib/Analysis/CallGraphSCCPass.cpp), and these are the problems that
>>>> Chandler has been trying to solve.
>>>>
>>>> (
>>>> Why is this "function passes inside CGSCC passes" stuff interesting?
>>>> Because LLVM can do inlining on an SCC (often just a single function) and
>>>> then run function passes to simplify the function(s) in the SCC before it
>>>> tries to inline into a parent SCC. (the SCC visitation order is post-order)
>>>> For example, we may inline a bunch of code, but after inlining we can
>>>> tremendously simplify the function, and we want to do so before considering
>>>> this function for inlining into its callers so that we get an accurate
>>>> evaluation of the inline cost.
>>>> Based on what Chandler said, it seems that LLVM is fairly unique in this
>>>> regard and other compilers don't do this (which is why we can't just look at
>>>> how other compilers solve this problem; they don't have this problem (maybe
>>>> they should? or maybe we shouldn't?)). For example, he described that GCC
>>>> uses different inlining "phases"; e.g. it does early inlining on the entire
>>>> module, then does simplifications on the entire module, then does late
>>>> inlining on the entire module; so it is not able to incrementally simplify
>>>> as it inlines like LLVM does.
>>>
>>> This incremental simplification is an important feature of our inliner,
>>> and one we should endeavor to keep. We might also want different phases at
>>> some point (e.g. a top-down and a bottom-up phase), but that's another
>>> story.
>>>>
>>>> )
>>>>
>>>> As background for what is below, the LazyCallGraph tracks two graphs:
>>>> the "call graph" and the "ref graph".
>>>> Conceptually, the call graph is the graph of direct calls, where
>>>> indirect calls and calls to external functions do not appear (or are
>>>> connected to dummy nodes). The ref graph is basically the graph of all
>>>> functions transitively accessible based on the globals/constants/etc.
>>>> referenced by a function (e.g. if a function `foo` references a vtable that
>>>> is defined in the module, there is an edge in the ref graph from `foo` to
>>>> every function in the vtable).
>>>> The call graph is a strict subset of the ref graph.
>>>>
>>>> Chandler described that he had a major breakthrough in that the CGSCC
>>>> pass manager only had to deal with 3 classes of modifications that can
>>>> occur:
>>>> - a pass may e.g. propagate a load of a function pointer into an
>>>> indirect call, turning it into an direct call. This requires adding an edge
>>>> in the CG but not in the ref graph.
>>>> - a pass may take a direct call and turn it into an indirect call. This
>>>> requires removing an edge from the CG, but not in the ref graph.
>>>> - a pass may delete a direct call. This removes an edge in the CG and
>>>> also in the ref graph.
>>>>
>>>> From the perspective of the CGSCC pass manager, these operations can
>>>> affect the SCC structure. Adding an edge might merge SCC's and deleting an
>>>> edge might split SCC's. Chandler mentioned that apparently the issues of
>>>> splitting and merging SCC's within the current infrastructure are actually
>>>> quite challenging and lead to e.g. iterator invalidation issues, and that is
>>>> what he is working on.
>>>>
>>>> (
>>>> The ref graph is important to guide the overall SCC visitation order
>>>> because it basically represents "the largest graph that the CG may turn into
>>>> due to our static analysis of this module". I.e. no transformation we can
>>>> statically make in the CGSCC passes can ever cause us to need to merge SCC's
>>>> in the ref graph.
>>>> )
>>>>
>>>>
>>>>
>>>> I have a couple overall questions/concerns:
>>>>
>>>>
>>>> 1. The ref graph can easily go quadratic. E.g.
>>>>
>>>> typedef void (*fp)();
>>>> fp funcs[] = {
>>>>   &foo1,
>>>>   &foo2,
>>>>   ...
>>>>   &fooN
>>>> }
>>>> void foo1() { funcs[something](); }
>>>> void foo2() { funcs[something](); }
>>>> ...
>>>> void fooN() { funcs[something](); }
>>>>
>>>> One real-world case where this might come about is in the presence of
>>>> vtables.
>>>>
>>>> The existing CGSCC pass manager does not have this issue AFAIK because
>>>> it does not consider the ref graph.
>>>>
>>>> Does anybody have any info/experience about how densely connected the
>>>> ref graph can get in programs that might reasonably be fed to the compiler?
>>>> I just did a quick sanity check with LLD/ELF using
>>>> `--lto-newpm-passes=cgscc(no-op-cgscc)` and it at least seemed to terminate
>>>> / not run out of memory. Based on some rough calculations looking at the
>>>> profile, it seem like the entire run of the inliner in the old LTO pipeline
>>>> (which is about 5% of total LTO time on this particular example I looked at)
>>>> is only 2-3x as expensive as just `--lto-newpm-passes=cgscc(no-op-cgscc)`,
>>>> so the LazyCallGraph construction is certainly not free.
>>>>
>>>>
>>>> 2. What is the intended behavior of CGSCC passes when SCC's are split or
>>>> merged? E.g. a CGSCC pass runs on an SCC (e.g. the inliner). Now we run some
>>>> function passes nested inside the CGSCC pass manager (e.g. to simplify
>>>> things after inlining). Consider:
>>>
>>> This is not how I thought the current scheme worked ;) -- I was under the
>>> impression that we had a call graph with conservatively-connected dummy
>>> nodes for external/indirect functions.
>>
>>
>> The fact that we have a separate nodes for calling into an external
>> function and "being called" from an external function, these don't form SCC.
>> So it means we can end up merging SCC IIUC.
>
> Yes, although I thought there was only one dummy node for those things.
>
> -Hal
>
>>
>> --
>> Mehdi
>>
>>
>>
>>> As a result, there is no semantics-preserving optimization that will
>>> merge SCCs, only split them. In that case, I'd expect that once an SCC is
>>> split, we re-run the CGSCC passes over the newly-separated SCCs. But this
>>> corresponds to running over the "ref graph", as you describe it. I don't
>>> understand why we want the non-conservative graph.
>>>
>>>>
>>>> a) These function passes are e.g. now able to devirtualize a call,
>>>> adding an edge to the CG, forming a larger CG SCC. Do you re-run the CGSCC
>>>> pass (say, the inliner) on this larger SCC?
>>>>
>>>> b) These function passes are e.g. able to DCE a call, removing an edge
>>>> from the CG. This converts, say, a CG SCC which is a cycle graph (like
>>>> a->b->c->a) into a path graph (a->b->c, with no edge back to a). The inliner
>>>> had already visited a, b, and c as a single SCC. Now does it have to
>>>> re-visit c, then b, then a, as single-node SCC's?
>>>>
>>>>
>>>> btw:
>>>>
>>>> One way that I have found it useful to think about this is in terms of
>>>> the visitation during Tarjan's SCC algorithm. I'll reference the pseudocode
>>>> in
>>>> https://en.wikipedia.org/wiki/Tarjan%27s_strongly_connected_components_algorithm.
>>>> Inside the "strongconnect" routine when we have identified an SCC (the true
>>>> branch of `if (v.lowlink = v.index)` test ) we can visit
>>>> stack[v.index:stack.size()] as an SCC. This may or may not invalidate some
>>>> things on the stack (the variable `S` in the pseudocode) and we may need to
>>>> fix it up (e.g. inlining deleted a function, so we can't have an entry on
>>>> the stack). Then, we can run function passes as we pop individual functions
>>>> off the stack, but it is easier to think about IMO than merging of SCC data
>>>> structures: if we add edges to the CG then we have to do more DFS on the new
>>>> edges and if we delete edges then the DFS order of the stack gives us
>>>> certain guarantees.
>>>> Personally I find this much easier to reason about than the description
>>>> in terms of splitting and merging SCC's in the CG and ref graph (which the
>>>> LazyCallGraph API makes one to think about since it hides the underlying
>>>> Tarjan's algorithm).
>>>> The LazyCallGraph API makes the current loop in
>>>> http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/include/llvm/Analysis/CGSCCPassManager.h;272124$100
>>>> very clean, but at least for my thinking about the problem, it seems like
>>>> the wrong abstraction (and most of the LazyCallGraph API seems to be unused,
>>>> so it seems like it may be overly heavyweight).
>>>> E.g. I think that maybe the easiest thing to do is to turn the current
>>>> approach inside out: instead of having the pass manager logic be the "normal
>>>> code" and forcing the Tarjan algorithm to become a state machine of
>>>> iterators, use an open-coded Tarjan algorithm with some callbacks and make
>>>> the pass management logic be the state machine.
>>>> This will also open the door to avoiding the potentially quadratic size
>>>> of the ref graph, since e.g. in the example I gave above, we can mark the
>>>> `funcs` array itself as already having been visited during the walk. In the
>>>> current LazyCallGraph, this would require adding some sort of notion of
>>>> hyperedge.
>>>
>>> FWIW, I see no purpose in abstracting one algorithm, especially if that
>>> makes things algorithmically harder. Also, the LazyCallGraph abstraction and
>>> the iterator abstraction seem like separate issues. Iterator abstractions
>>> are often useful because you can use them in generic algorithms, etc.
>>>>
>>>>
>>>> Since this is such a high priority (due to blocking PGO inlining), I
>>>> will probably try my hand at implementing the CGSCC pass manager sometime
>>>> soon unless somebody beats me to it. (I'll probably try the "open-coded SCC
>>>> visit" approach).
>>>>
>>>> Another possibility is implementing the new CGSCC pass manager that uses
>>>> the same visitation semantics as the one in the old PM, and then we can
>>>> refactor that as needed. In fact, that may be the best approach so that
>>>> porting to the new PM is as NFC as possible and we can isolate the
>>>> functional (i.e., need benchmarks, measurements ...) changes in separate
>>>> commits.
>>>
>>> I'm in favor of this approach for exactly the reason you mention. Being
>>> able to bisect regressions to the algorithmic change, separate from the
>>> infrastructure change, will likely make things easier in the long run (and
>>> will avoid the problem, to the extent possible, of performance regressions
>>> blocking the pass-manager work).
>>>>
>>>>
>>>> Sorry for the wall of text.
>>>
>>> No problem; much appreciated.
>>>
>>>  -Hal
>>>
>>>>
>>>> -- Sean Silva
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>>
>>>
>>> --
>>> Hal Finkel
>>> Assistant Computational Scientist
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>

-- 
Sanjoy Das
http://playingwithpointers.com