[LLVMdev] Postponing more passes in LTO

Thu Dec 18 02:45:56 PST 2014

On Wed, Sep 17, 2014 at 6:46 AM, Daniel Stewart <stewartd at codeaurora.org>
wrote:
>
> Looking at the existing flow of passes for LTO, it appears that most all
> passes are run on a per file basis, before the call to the gold linker. I’m
> looking to get people’s feedback on whether there would be an advantage to
> waiting to run a number of these passes until the linking stage. For
> example, I believe I saw a post a little while back about postponing
> vectorization until the linking stage. It seems to me that there could be
> an advantage to postponing (some) passes until the linking stage, where the
> entire graph is available. In general, what do people think about the idea
> of a different flow of LTO where more passes are postponed until the
> linking stage?
>

AFAIK, we still mostly obey the per-TU optimization flags. E.g. if you pass
-O3 for each TU, we will run -O3 passes without really keeping in mind that
we are doing LTO (or if we do, then it is fairly minimal). The per-TU
optimization flags can have an enormous impact on the final binary size.
Here are some data points I have recently collected on one of our
first-party games:

noLTO O3perTU:
71.1MiB (control)

LTO O3perTU:
71.8MiB (1% larger)

LTO O0perTU:
67.4MiB (5% smaller)

LTO O1perTU:
68.5MiB (4% smaller)

LTO OsperTU:
65.3MiB (8% smaller)

This is with a 3.4-based compiler btw, but is in keeping with what I
observed last Summer, so I assume that the significant effect on binary
size is still present today.
FYI, these elf sizes are also with no debug info.

Here is a visualization of those same binary files, but broken down by text
and data sections (as given by llvm-size; bss was not significantly
affected so it was omitted):

http://i.imgur.com/Ie5Plgx.png

As you can see (and would expect), LTO does a good job of reducing the
data size, since it can use whole-program analysis to eliminate it. This
benefit does not depend on the per-TU optimization level, also as you would
expect.
The text section however has a different behavior. I'm still investigating,
but I suspect any size regression is largely due to excessive inlining (as
I think most people would expect). It is interesting to note that between
the -Os LTO case and -O3 LTO case, there is a text size difference of
(20.7/14.3 - 1) ~ 45%. Also, looking at this again, I don't understand why
I didn't do anything with -O2 (I'll eventually need to re-obtain these
datasets with a ToT compiler anyway, and I will be sure to grab -O2 data);
my experience is that Clang's -O3 is sufficiently similar to -O2 that I'm
fairly confident that this missing data is not going to significantly alter
the findings of my preliminary analysis in the upcoming days.

For starters, here is a plot showing how much of the total text size is
attributable to functions of each size, comparing -O3 noLTO with -O3 LTO:

http://i.imgur.com/pfIo0sy.png [*]
To understand this plot, imagine if you were to take all the functions in
the binary, and group them into a small number of buckets of
similarly-sized functions. Each bar represents one bucket, and the height
of the bar represents the total size of all the functions in that bucket.
The width and position of the bucket indicate which range of sizes it
corresponds to.
Although the general behavior is a shift in the distribution to the right
(functions are becoming larger with LTO), there is also an increase in
total area under the bars, which is perhaps best visualized by looking at
the same plot, but with each bar indicating the cumulative total (imagine
that you were to call std::accumulate on the list of bar heights from the
previous plot):

http://i.imgur.com/q7Iq7AH.png
The overall text size regression adds up to nearly 25%.

[*] The two outliers in the non-LTO case are:
- the global initializers (_GLOBAL__I_a), whose size is significantly
reduced by LTO from about 400k to 100k (this single function corresponds to
the entire furthest-right bar). Note: the right-most bar for the LTO
dataset (>100kB functions) is this function (slimmed down to about 100k)
and one other that was subjected to an unusually large amount of inlining
and grew from 2k to about 125k.
- an unusually large dead function that LTO was able to remove but was not
being removed before (this single function corresponds to the entire
second-to-furthest-right bar).

-- Sean Silva

>
>
> Daniel Stewart
>
>
>
> --
>
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
> by The Linux Foundation
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141218/c9c05c85/attachment.html>