[PATCH] D36352: [LCG] Switch one of the update methods for the LazyCallGraph to support limited batch updates.

Sean Silva via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Sat Aug 5 21:01:21 PDT 2017


silvas added a comment.

In https://reviews.llvm.org/D36352#833140, @chandlerc wrote:

> In https://reviews.llvm.org/D36352#833118, @silvas wrote:
>
> > > When compiling the IR for this file with 'opt' and 'https://reviews.llvm.org/owners/package/3/', this patch reduces the total time by 8-9%.
> >
> > Just as a point of reference, do you have handy how much of the total compile time are we spending in the PM's CGSCC / LCG  stuff?
>
>
> Yep, that's how I got here. Before all of my changes to LCG, we spent just under 40% of the total optimizer time (and nearly that much of the total compile time) in this one method.
>
> Removing the parent sets made this method obnoxiously faster (over 2x) and shaved just over 20% off the total optimizer time.
>
> This patch makes this method another 2x faster, and shaves 8% off the total optimizer time.
>
> After both of these, this method remains the only thing hot really, and it is now between 6-8% depending on the noise in the profile. I'm still looking for more things to make faster here. I think i have another 2-4 things that will make it faster.
>
> > If we can speed up the overall time by 8-9% by improving it then it suggests that the total time is much larger, which I find somewhat surprising (presumably, we should be spending the vast majority of our time inside of the optimizations themselves).
>
> Me too. Turns out this method is really, really hot.
>
> > Is this module you're working on just a really pathological case?
>
> It is entirely possible this module is pathological -- I've not seen this when profiling other compiles.
>
> Unfortunately, it is a pathological case that is *generated code* that for me, happens to represent like 10% of all the files we compile and many of the slowest files we compile. So even if this module is really, really weird I still need it to compile very fast. =]
>
> > Or is this maybe a situation where the denormalized ref edge representation is causing a bunch of extra work?
>
> I'm sure this is part of the problem. But fixing this is (much) harder than making this routine fast. And it doesn't *appear* to be the primary problem. I instrumented this routine expecting to see removal of 100,000+ edges all the time and think "oh, so its denormalized...". That wasn't what happened. In this (potentially pathological) case, there are only like 7k ref edges ever removed in the entire opt run! And most (some 3.5k) are removed due to a *single* ref edge becoming dead after the inliner runs. The fact that the optimizer is deleting one edge at a time is pretty strong evidence that the denormalization isn't as prominent in this problem as I would have expected. I'm pretty sure based on these numbers that the real reason this ends up slow in this module is the size of RefSCC that we're removing edges from.


In case you haven't dug in, I just dug in a bit in Mathematica out of curiousity and it seems that the large RefSCC's come from the lazy descriptor initialization.

This is the only RefSCC in the .pb.cc file I tried that wasn't just a single node. The .proto this was generated contains a message `Foo` containing a message `Bar` containing a message `Message3` (.proto: https://reviews.llvm.org/F4745358, .pb.cc: https://reviews.llvm.org/F4745329, .pb.h: https://reviews.llvm.org/F4745572).

F4740125: Screenshot from 2017-08-05 20-27-01.png <https://reviews.llvm.org/F4740125>
https://reviews.llvm.org/F4740125 
(sorry, Mathematica doesn't have an easy way to jitter the nodes with this layout so that the labels don't overlap; bummer)

`_Z31protobuf_AssignDesc_foo_2eprotov` contains a call edge to `_Z28protobuf_AddDesc_foo_2eprotov` which contains ref edges to `_ZN3FooC2Ev` and the corresponding constructors for Baz and Message3. In turn, these constructors reference the vtable pointer which leads to the `MergeFrom` (the virtual one taking a `::google::protobuf::Message`) and virtual `MergePartialFromCodedStream` which are big methods that recursively call into similar code of any contained messages.

In other words, it seems that there will always be a RefSCC that is of size approximately O(num messages defined in a given .proto file) which explains why you're seeing such massive RefSCC's.

(the dot file this came from is https://reviews.llvm.org/F4735714 which was generated with `opt -passes=print-lcg-dot`)

> Still, I suspect *this* patch to be mostly making the denormalized representation scale better (by batching their removal). But it seems much easier to do this than to switch representations, and this patch actually makes the code simpler, so it seemed like a fine intermediate step.

Makes sense. Just looking at the graph above, it makes intuitive sense because since vtables don't contain pointers to other vtables, the big fanout doesn't end up being multiplied by the transitive closure of base classes or something like that. I.e. the ref fanout is limited by the size of a single vtable, rather than potentially the total size of all vtables in a module. So that's not terribly problematic.


https://reviews.llvm.org/D36352





More information about the llvm-commits mailing list