<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Dec 17, 2016 at 1:35 PM, Davide Italiano via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">First of all, sorry for the long mail.<br>
Inspired by the excellent analysis Rui did for lld, I decided to do<br>
the same for llvm.<br>
I'm personally very interested in build-time for LTO configuration,<br>
with particular attention to the time spent in the optimizer.<br>
Rafael did something similar back in March, so this can be considered<br>
as an update. This tries to include a more accurate high-level<br>
analysis of where llvm is spending CPU cycles.<br>
Here I present 2 cases: clang building itself with `-flto` (Full), and<br>
clang building an internal codebase which I'm going to refer as<br>
`game7`.<br>
It's a mid-sized program (it's actually a game), more or less of the<br>
size of clang, which we use internally as benchmark to track<br>
compile-time/runtime improvements/regression.<br>
I picked two random revisions of llvm: trunk (December 16th 2016) and<br>
trunk (June 2nd 2016), so, roughly, 6 months period.<br>
My setup is a Mac Pro running Linux (NixOS).<br>
These are the numbers I collected (including the output of -mllvm -time-passes).<br>
For clang:<br>
<br>
June 2nd:<br>
real  22m9.278s<br>
user  21m30.410s<br>
sys   0m38.834s<br>
 Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)<br>
 289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 (<br>
24.3%)Â X86 DAG->DAG Instruction Selection<br>
 97.2730 ( 7.9%)  0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 (<br>
7.7%)Â Global Value Numbering<br>
 62.4091 ( 5.1%)  0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 (<br>
5.0%)Â Function Integration/Inlining<br>
 58.6923 ( 4.8%)  0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 (<br>
4.7%)Â Combine redundant instructions<br>
 53.9602 ( 4.4%)  0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 (<br>
4.3%)Â Combine redundant instructions<br>
 51.0470 ( 4.1%)  0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 (<br>
4.1%)Â Loop Strength Reduction<br>
 47.4067 ( 3.8%)  1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 (<br>
3.8%)Â Greedy Register Allocator<br>
 36.7463 ( 3.0%)  0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 (<br>
3.0%)Â Induction Variable Simplification<br>
 37.0125 ( 3.0%)  0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 (<br>
2.9%)Â Combine redundant instructions<br>
 34.2071 ( 2.8%)  0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 (<br>
2.7%)Â Combine redundant instructions<br>
 25.6627 ( 2.1%)  0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 (<br>
2.0%)Â Combine redundant instructions<br>
<br>
Dec 16th:<br>
real  27m34.922s<br>
user  26m53.489s<br>
sys   0m41.533s<br>
<br>
 287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 (<br>
19.3%)Â X86 DAG->DAG Instruction Selection<br>
 197.9211 ( 12.7%)  0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 (<br>
12.5%)Â Function Integration/Inlining<br>
 106.9669 ( 6.9%)  0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 (<br>
6.8%)Â Global Value Numbering<br>
 89.7571 ( 5.8%)  0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 (<br>
5.7%)Â Combine redundant instructions<br>
 79.0456 ( 5.1%)  0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 (<br>
5.0%)Â Combine redundant instructions<br>
 55.6393 ( 3.6%)  0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 (<br>
3.5%)Â Combine redundant instructions<br>
 51.8663 ( 3.3%)  1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 (<br>
3.3%)Â Greedy Register Allocator<br>
 52.5721 ( 3.4%)  0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 (<br>
3.3%)Â Combine redundant instructions<br>
 49.0593 ( 3.2%)  0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 (<br>
3.1%)Â Loop Strength Reduction<br>
 41.2602 ( 2.7%)  0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 (<br>
2.6%)Â Induction Variable Simplification<br>
 38.1438 ( 2.5%)  0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 (<br>
2.4%)Â Combine redundant instructions<br>
<br>
so, llvm is around 20% slower than it used to be.<br>
<br>
For our internal codebase the situation seems slightly worse:<br>
<br>
`game7`<br>
<br>
June 2nd:<br>
<br>
Total Execution Time: 464.3920 seconds (463.8986 wall clock)<br>
<br>
 88.0204 ( 20.3%)  6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 (<br>
20.3%)Â X86 DAG->DAG Instruction Selection<br>
 27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 (<br>
9.4%)Â X86 Assembly / Object Emitter<br>
 34.9581 ( 8.1%)  0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 (<br>
7.6%)Â Function Integration/Inlining<br>
 27.8556 ( 6.4%)  0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 (<br>
6.1%)Â Global Value Numbering<br>
 22.1479 ( 5.1%)  0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 (<br>
4.8%)Â Combine redundant instructions<br>
 19.2346 ( 4.4%)  0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 (<br>
4.2%)Â Post RA top-down list latency scheduler<br>
 15.8085 ( 3.6%)  0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 (<br>
3.5%)Â Combine redundant instructions<br>
<br>
Dec 16th:<br>
<br>
Total Execution Time: 861.0898 seconds (860.5808 wall clock)<br>
<br>
 135.7207 ( 15.7%)  0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 (<br>
15.2%)Â Combine redundant instructions<br>
 103.6609 ( 12.0%)  0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 (<br>
11.7%)Â Combine redundant instructions<br>
 97.1083 ( 11.3%)  6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 (<br>
11.6%)Â X86 DAG->DAG Instruction Selection<br>
 72.6125 ( 8.4%)  0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 (<br>
8.1%)Â Combine redundant instructions<br>
 69.2144 ( 8.0%)  0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 (<br>
7.8%)Â Function Integration/Inlining<br>
 60.7837 ( 7.1%)  0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 (<br>
6.8%)Â Global Value Numbering<br>
 56.5650 ( 6.6%)  0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 (<br>
6.4%)Â Combine redundant instructions<br>
<br>
so, using LTO, lld takes 2x to build what it used to take (and all the<br>
extra time seems spent in the optimizer).<br>
<br>
As an (extra) experiment, I decided to take the unoptimized output of<br>
game7 (via lld -save-temps) and pass to -opt -O2. That shows another<br>
significant regression (with different characteristics).<br>
<br>
June 2nd:<br>
time opt -O2<br>
real  6m23.016s<br>
user  6m20.900s<br>
sys   0m2.113s<br>
<br>
35.9071 ( 10.0%)Â Â 0.7996 ( 10.9%)Â 36.7066 ( 10.0%)Â 36.6900 ( 10.1%)<br>
 Function Integration/Inlining<br>
33.4045 (Â 9.3%)Â Â 0.4053 (Â 5.5%)Â 33.8098 (Â 9.3%)Â 33.7919 (Â 9.3%)<br>
 Global Value Numbering<br>
27.1053 (Â 7.6%)Â Â 0.5940 (Â 8.1%)Â 27.6993 (Â 7.6%)Â 27.6995 (Â 7.6%)<br>
 Bitcode Writer<br>
25.6492 (Â 7.2%)Â Â 0.2491 (Â 3.4%)Â 25.8984 (Â 7.1%)Â 25.8805 (Â 7.1%)<br>
 Combine redundant instructions<br>
19.2686 (Â 5.4%)Â Â 0.2956 (Â 4.0%)Â 19.5642 (Â 5.4%)Â 19.5471 (Â 5.4%)<br>
 Combine redundant instructions<br>
18.6697 (Â 5.2%)Â Â 0.2625 (Â 3.6%)Â 18.9323 (Â 5.2%)Â 18.9148 (Â 5.2%)<br>
 Combine redundant instructions<br>
16.1294 (Â 4.5%)Â Â 0.2320 (Â 3.2%)Â 16.3614 (Â 4.5%)Â 16.3434 (Â 4.5%)<br>
 Combine redundant instructions<br>
13.5476 (Â 3.8%)Â Â 0.3945 (Â 5.4%)Â 13.9421 (Â 3.8%)Â 13.9295 (Â 3.8%)<br>
 Combine redundant instructions<br>
13.1746 (Â 3.7%)Â Â 0.1767 (Â 2.4%)Â 13.3512 (Â 3.7%)Â 13.3405 (Â 3.7%)<br>
 Combine redundant instructions<br>
<br>
Dec 16th:<br>
<br>
real  20m10.734s<br>
user  20m8.523s<br>
sys   0m2.197s<br>
<br>
 208.8113 ( 17.6%)  0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 (<br>
17.5%)Â Value Propagation<br>
 179.6863 ( 15.2%)  0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 (<br>
15.1%)Â Value Propagation<br>
 92.0158 ( 7.8%)  0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 (<br>
7.7%)Â Combine redundant instructions<br>
 72.3330 ( 6.1%)  0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 (<br>
6.1%)Â Combine redundant instructions<br>
 72.2505 ( 6.1%)  0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 (<br>
6.1%)Â Combine redundant instructions<br>
 66.6765 ( 5.6%)  0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 (<br>
5.6%)Â Combine redundant instructions<br>
 65.5029 ( 5.5%)  0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 (<br>
5.5%)Â Combine redundant instructions<br>
 61.8355 ( 5.2%)  0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 (<br>
5.2%)Â Function Integration/Inlining<br>
 54.9184 ( 4.6%)  0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 (<br>
4.6%)Â Combine redundant instructions<br>
 50.2597 ( 4.2%)  0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 (<br>
4.2%)Â Combine redundant instructions<br>
 47.2597 ( 4.0%)  0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 (<br>
4.0%)Â Global Value Numbering<br>
<br>
I don't have an infrastructure to measure the runtime performance<br>
benefits/regression of clang, but I have for `game7`.<br>
I wasn't able to notice any fundamental speedup (at least, not<br>
something that justifies a 2x build-time).<br>
<br>
tl;dr:<br>
There are quite a few things to notice:<br>
1) GVN used to be the top pass in the middle-end, in some cases, and<br>
pretty much always in the top-3. This is not the case anymore, but<br>
it's still a pass where we spend a lot of time. This is being worked<br>
on by Daniel Berlin and me) <a href="https://reviews.llvm.org/D26224" rel="noreferrer" target="_blank">https://reviews.llvm.org/<wbr>D26224</a> so there's<br>
some hope that will be sorted out (or at least there's a plan for it).<br>
2) For clang, we spend 35% more time inside instcombine, and for game7<br>
instcombine seems to largely dominate the amount of time we spend<br>
optimizing IR. I tried to bisect (which is not easy considering the<br>
test takes a long time to run), but I wasn't able to identify a single<br>
point in time responsible for the regression. It seems to be an<br>
additive effect. My wild (or not so wild) guess is that every day<br>
somebody adds a matcher of two because that improves their testcase,<br>
and at some point all this things add up. I'll try to do some<br>
additional profiling but I guess large part of our time is spent<br>
solving bitwise-domain dataflow problems (ComputeKnownBits et<br>
similia). Once GVN will be in a more stable state, I plan to<br>
experiment with caching results.<br>
3) Something peculiar is that we spend 2x time in the inliner. I'm not<br>
familiar with the inliner, IIRC there were some changes to threshold<br>
recently, so any help here will be appreciated (both in reproducing<br>
the results and with analysis).<br>
4) For the last testcase (opt -O2 on large unoptimized chunk of<br>
bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I<br>
think it's not as lazy as it claims to be (or at least, the way we use<br>
it). This doesn't show up in a full LTO run because we don't run CVP<br>
as part of the default LTO pipeline, but the case shows how LVI can be<br>
a bottleneck for large TUs (or at least how -O2 compile time degraded<br>
on large TUs). I haven't thought about the problem very carefully, but<br>
there seems to be some progress on this front (<br>
<a href="https://llvm.org/bugs/show_bug.cgi?id=10584" rel="noreferrer" target="_blank">https://llvm.org/bugs/show_<wbr>bug.cgi?id=10584</a>). I can't share the<br>
original bitcode file but I can probably do some profiling on it as<br>
well.<br></blockquote><div><br></div><div>LVI is one of those analyses with quadratic runtime, but has a cutoff to its search depth so that it is technically not quadratic. So increased inlining could easily exacerbate it more than non-"quadratic" passes. (increased inlining would also cause a general slowdown too).</div><div><br></div><div>-- Sean Silva</div><div>Â </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
As next steps I'll try to get a more detailed analysis of the<br>
problems. In particular, try to do what Rui did for lld but with more<br>
coarse granularity (every week) to have a chart of the compile time<br>
trend for these cases over the last 6 months, and post here.<br>
<br>
I think (I know) some people are aware of the problems I outline in<br>
this e-mail. But apparently not everybody. We're in a situation where<br>
compile time is increasing without real control. I'm happy that Apple<br>
is doing a serious effort to track build-time, so hopefully things<br>
will improve. There are, although, some cases (like adding matchers in<br>
instcombine or knobs) where the compile time regression is hard to<br>
track until it's too late. LLVM as a project tries not to stop people<br>
trying to get things done and that's great, but from time to time it's<br>
good to take a step back and re-evaluate approaches.<br>
The purpose of this e-mail was to outline where we regressed, for<br>
those interested.<br>
<br>
Thanks for your time, and of course, feedback welcome!<br>
<span class="HOEnZb"><font color="#888888"><br>
--<br>
Davide<br>
<br>
"There are no solved problems; there are only problems that are more<br>
or less solved" -- Henri Poincare<br>
______________________________<wbr>_________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>
</font></span></blockquote></div><br></div></div>