<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Dec 17, 2016 at 5:19 PM, Michael Gottesman via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5"><br>

> On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br>

><br>

> First of all, sorry for the long mail.<br>

> Inspired by the excellent analysis Rui did for lld, I decided to do<br>

> the same for llvm.<br>

> I'm personally very interested in build-time for LTO configuration,<br>

> with particular attention to the time spent in the optimizer.<br>

> Rafael did something similar back in March, so this can be considered<br>

> as an update. This tries to include a more accurate high-level<br>

> analysis of where llvm is spending CPU cycles.<br>

> Here I present 2 cases: clang building itself with `-flto` (Full), and<br>

> clang building an internal codebase which I'm going to refer as<br>

> `game7`.<br>

> It's a mid-sized program (it's actually a game), more or less of the<br>

> size of clang, which we use internally as benchmark to track<br>

> compile-time/runtime improvements/regression.<br>

> I picked two random revisions of llvm: trunk (December 16th 2016) and<br>

> trunk (June 2nd 2016), so, roughly, 6 months period.<br>

> My setup is a Mac Pro running Linux (NixOS).<br>

> These are the numbers I collected (including the output of -mllvm -time-passes).<br>

> For clang:<br>

><br>

> June 2nd:<br>

> real    22m9.278s<br>

> user    21m30.410s<br>

> sys     0m38.834s<br>

>  Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)<br>

>  289.8102 ( 23.5%)  18.8891 ( 53.7%)  308.6993 ( 24.3%)  308.6906 (<br>

> 24.3%)  X86 DAG->DAG Instruction Selection<br>

>  97.2730 (  7.9%)   0.7656 (  2.2%)  98.0386 (  7.7%)  98.0010 (<br>

> 7.7%)  Global Value Numbering<br>

>  62.4091 (  5.1%)   0.4779 (  1.4%)  62.8870 (  4.9%)  62.8665 (<br>

> 5.0%)  Function Integration/Inlining<br>

>  58.6923 (  4.8%)   0.4767 (  1.4%)  59.1690 (  4.7%)  59.1323 (<br>

> 4.7%)  Combine redundant instructions<br>

>  53.9602 (  4.4%)   0.6163 (  1.8%)  54.5765 (  4.3%)  54.5409 (<br>

> 4.3%)  Combine redundant instructions<br>

>  51.0470 (  4.1%)   0.5703 (  1.6%)  51.6173 (  4.1%)  51.5425 (<br>

> 4.1%)  Loop Strength Reduction<br>

>  47.4067 (  3.8%)   1.3040 (  3.7%)  48.7106 (  3.8%)  48.7034 (<br>

> 3.8%)  Greedy Register Allocator<br>

>  36.7463 (  3.0%)   0.8133 (  2.3%)  37.5597 (  3.0%)  37.4612 (<br>

> 3.0%)  Induction Variable Simplification<br>

>  37.0125 (  3.0%)   0.2699 (  0.8%)  37.2824 (  2.9%)  37.2478 (<br>

> 2.9%)  Combine redundant instructions<br>

>  34.2071 (  2.8%)   0.2737 (  0.8%)  34.4808 (  2.7%)  34.4487 (<br>

> 2.7%)  Combine redundant instructions<br>

>  25.6627 (  2.1%)   0.3215 (  0.9%)  25.9842 (  2.0%)  25.9509 (<br>

> 2.0%)  Combine redundant instructions<br>

><br>

> Dec 16th:<br>

> real    27m34.922s<br>

> user    26m53.489s<br>

> sys     0m41.533s<br>

><br>

>  287.5683 ( 18.5%)  19.7048 ( 52.3%)  307.2731 ( 19.3%)  307.2648 (<br>

> 19.3%)  X86 DAG->DAG Instruction Selection<br>

>  197.9211 ( 12.7%)   0.5104 (  1.4%)  198.4314 ( 12.5%)  198.4091 (<br>

> 12.5%)  Function Integration/Inlining<br>

>  106.9669 (  6.9%)   0.8316 (  2.2%)  107.7984 (  6.8%)  107.7633 (<br>

> 6.8%)  Global Value Numbering<br>

>  89.7571 (  5.8%)   0.4840 (  1.3%)  90.2411 (  5.7%)  90.2067 (<br>

> 5.7%)  Combine redundant instructions<br>

>  79.0456 (  5.1%)   0.7534 (  2.0%)  79.7990 (  5.0%)  79.7630 (<br>

> 5.0%)  Combine redundant instructions<br>

>  55.6393 (  3.6%)   0.3116 (  0.8%)  55.9509 (  3.5%)  55.9187 (<br>

> 3.5%)  Combine redundant instructions<br>

>  51.8663 (  3.3%)   1.4090 (  3.7%)  53.2754 (  3.3%)  53.2684 (<br>

> 3.3%)  Greedy Register Allocator<br>

>  52.5721 (  3.4%)   0.3021 (  0.8%)  52.8743 (  3.3%)  52.8416 (<br>

> 3.3%)  Combine redundant instructions<br>

>  49.0593 (  3.2%)   0.6101 (  1.6%)  49.6694 (  3.1%)  49.5904 (<br>

> 3.1%)  Loop Strength Reduction<br>

>  41.2602 (  2.7%)   0.9608 (  2.5%)  42.2209 (  2.7%)  42.1122 (<br>

> 2.6%)  Induction Variable Simplification<br>

>  38.1438 (  2.5%)   0.3486 (  0.9%)  38.4923 (  2.4%)  38.4603 (<br>

> 2.4%)  Combine redundant instructions<br>

><br>

> so, llvm is around 20% slower than it used to be.<br>

><br>

> For our internal codebase the situation seems slightly worse:<br>

><br>

> `game7`<br>

><br>

> June 2nd:<br>

><br>

> Total Execution Time: 464.3920 seconds (463.8986 wall clock)<br>

><br>

>  88.0204 ( 20.3%)   6.0310 ( 20.0%)  94.0514 ( 20.3%)  94.0473 (<br>

> 20.3%)  X86 DAG->DAG Instruction Selection<br>

>  27.4382 (  6.3%)  16.2437 ( 53.9%)  43.6819 (  9.4%)  43.6823 (<br>

> 9.4%)  X86 Assembly / Object Emitter<br>

>  34.9581 (  8.1%)   0.5274 (  1.8%)  35.4855 (  7.6%)  35.4679 (<br>

> 7.6%)  Function Integration/Inlining<br>

>  27.8556 (  6.4%)   0.3419 (  1.1%)  28.1975 (  6.1%)  28.1824 (<br>

> 6.1%)  Global Value Numbering<br>

>  22.1479 (  5.1%)   0.2258 (  0.7%)  22.3737 (  4.8%)  22.3593 (<br>

> 4.8%)  Combine redundant instructions<br>

>  19.2346 (  4.4%)   0.3639 (  1.2%)  19.5985 (  4.2%)  19.5870 (<br>

> 4.2%)  Post RA top-down list latency scheduler<br>

>  15.8085 (  3.6%)   0.2675 (  0.9%)  16.0760 (  3.5%)  16.0614 (<br>

> 3.5%)  Combine redundant instructions<br>

><br>

> Dec 16th:<br>

><br>

> Total Execution Time: 861.0898 seconds (860.5808 wall clock)<br>

><br>

>  135.7207 ( 15.7%)   0.2484 (  0.8%)  135.9692 ( 15.2%)  135.9531 (<br>

> 15.2%)  Combine redundant instructions<br>

>  103.6609 ( 12.0%)   0.4566 (  1.4%)  104.1175 ( 11.7%)  104.1014 (<br>

> 11.7%)  Combine redundant instructions<br>

>  97.1083 ( 11.3%)   6.9183 ( 21.8%)  104.0266 ( 11.6%)  104.0181 (<br>

> 11.6%)  X86 DAG->DAG Instruction Selection<br>

>  72.6125 (  8.4%)   0.1701 (  0.5%)  72.7826 (  8.1%)  72.7678 (<br>

> 8.1%)  Combine redundant instructions<br>

>  69.2144 (  8.0%)   0.6060 (  1.9%)  69.8204 (  7.8%)  69.8007 (<br>

> 7.8%)  Function Integration/Inlining<br>

>  60.7837 (  7.1%)   0.3783 (  1.2%)  61.1620 (  6.8%)  61.1455 (<br>

> 6.8%)  Global Value Numbering<br>

>  56.5650 (  6.6%)   0.1980 (  0.6%)  56.7630 (  6.4%)  56.7476 (<br>

> 6.4%)  Combine redundant instructions<br>

><br>

> so, using LTO, lld takes 2x to build what it used to take (and all the<br>

> extra time seems spent in the optimizer).<br>

><br>

> As an (extra) experiment, I decided to take the unoptimized output of<br>

> game7 (via lld -save-temps) and pass to -opt -O2. That shows another<br>

> significant regression (with different characteristics).<br>

><br>

> June 2nd:<br>

> time opt -O2<br>

> real    6m23.016s<br>

> user   6m20.900s<br>

> sys     0m2.113s<br>

><br>

> 35.9071 ( 10.0%)   0.7996 ( 10.9%)  36.7066 ( 10.0%)  36.6900 ( 10.1%)<br>

> Function Integration/Inlining<br>

> 33.4045 (  9.3%)   0.4053 (  5.5%)  33.8098 (  9.3%)  33.7919 (  9.3%)<br>

> Global Value Numbering<br>

> 27.1053 (  7.6%)   0.5940 (  8.1%)  27.6993 (  7.6%)  27.6995 (  7.6%)<br>

> Bitcode Writer<br>

> 25.6492 (  7.2%)   0.2491 (  3.4%)  25.8984 (  7.1%)  25.8805 (  7.1%)<br>

> Combine redundant instructions<br>

> 19.2686 (  5.4%)   0.2956 (  4.0%)  19.5642 (  5.4%)  19.5471 (  5.4%)<br>

> Combine redundant instructions<br>

> 18.6697 (  5.2%)   0.2625 (  3.6%)  18.9323 (  5.2%)  18.9148 (  5.2%)<br>

> Combine redundant instructions<br>

> 16.1294 (  4.5%)   0.2320 (  3.2%)  16.3614 (  4.5%)  16.3434 (  4.5%)<br>

> Combine redundant instructions<br>

> 13.5476 (  3.8%)   0.3945 (  5.4%)  13.9421 (  3.8%)  13.9295 (  3.8%)<br>

> Combine redundant instructions<br>

> 13.1746 (  3.7%)   0.1767 (  2.4%)  13.3512 (  3.7%)  13.3405 (  3.7%)<br>

> Combine redundant instructions<br>

><br>

> Dec 16th:<br>

><br>

> real    20m10.734s<br>

> user    20m8.523s<br>

> sys     0m2.197s<br>

><br>

>  208.8113 ( 17.6%)   0.1703 (  1.9%)  208.9815 ( 17.5%)  208.9698 (<br>

> 17.5%)  Value Propagation<br>

>  179.6863 ( 15.2%)   0.1215 (  1.3%)  179.8077 ( 15.1%)  179.7974 (<br>

> 15.1%)  Value Propagation<br>

>  92.0158 (  7.8%)   0.2674 (  3.0%)  92.2832 (  7.7%)  92.2613 (<br>

> 7.7%)  Combine redundant instructions<br>

>  72.3330 (  6.1%)   0.6026 (  6.7%)  72.9356 (  6.1%)  72.9210 (<br>

> 6.1%)  Combine redundant instructions<br>

>  72.2505 (  6.1%)   0.2167 (  2.4%)  72.4672 (  6.1%)  72.4539 (<br>

> 6.1%)  Combine redundant instructions<br>

>  66.6765 (  5.6%)   0.3482 (  3.9%)  67.0247 (  5.6%)  67.0040 (<br>

> 5.6%)  Combine redundant instructions<br>

>  65.5029 (  5.5%)   0.4092 (  4.5%)  65.9121 (  5.5%)  65.8913 (<br>

> 5.5%)  Combine redundant instructions<br>

>  61.8355 (  5.2%)   0.8150 (  9.0%)  62.6505 (  5.2%)  62.6315 (<br>

> 5.2%)  Function Integration/Inlining<br>

>  54.9184 (  4.6%)   0.3359 (  3.7%)  55.2543 (  4.6%)  55.2332 (<br>

> 4.6%)  Combine redundant instructions<br>

>  50.2597 (  4.2%)   0.2187 (  2.4%)  50.4784 (  4.2%)  50.4654 (<br>

> 4.2%)  Combine redundant instructions<br>

>  47.2597 (  4.0%)   0.3719 (  4.1%)  47.6316 (  4.0%)  47.6105 (<br>

> 4.0%)  Global Value Numbering<br>

><br>

> I don't have an infrastructure to measure the runtime performance<br>

> benefits/regression of clang, but I have for `game7`.<br>

> I wasn't able to notice any fundamental speedup (at least, not<br>

> something that justifies a 2x build-time).<br>

><br>

> tl;dr:<br>

> There are quite a few things to notice:<br>

> 1) GVN used to be the top pass in the middle-end, in some cases, and<br>

> pretty much always in the top-3. This is not the case anymore, but<br>

> it's still a pass where we spend a lot of time. This is being worked<br>

> on by Daniel Berlin and me) <a href="https://reviews.llvm.org/D26224" rel="noreferrer" target="_blank">https://reviews.llvm.org/<wbr>D26224</a> so there's<br>

> some hope that will be sorted out (or at least there's a plan for it).<br>

> 2) For clang, we spend 35% more time inside instcombine, and for game7<br>

> instcombine seems to largely dominate the amount of time we spend<br>

> optimizing IR. I tried to bisect (which is not easy considering the<br>

> test takes a long time to run), but I wasn't able to identify a single<br>

> point in time responsible for the regression. It seems to be an<br>

> additive effect. My wild (or not so wild) guess is that every day<br>

> somebody adds a matcher of two because that improves their testcase,<br>

> and at some point all this things add up. I'll try to do some<br>

> additional profiling but I guess large part of our time is spent<br>

> solving bitwise-domain dataflow problems (ComputeKnownBits et<br>

> similia). Once GVN will be in a more stable state, I plan to<br>

> experiment with caching results.<br>

<br>

</div></div>We have seen a similar thing when compiling the swift standard library. We have talked about making a small simple instcombine pass that doesn't iterate to a fixed point (IIRC). IIRC Andy/Mark looked at this (so my memory might be wrong).<br></blockquote><div><br></div><div>I would expect that if the slowdown was an increase in time spent iterating to a fixed point, we would also be matching more patterns and simplifying the code more, but the small performance delta on game7 doesn't seem to corroborate that.</div><div><br></div><div>One issue with instcombine is that its matching time (even if it doesn't find anything) is still O(# of patterns tested). That is very consistent with the slowdown model where as new checks are added, it gradually gets slower. Approaches like Nuno's Alive avoids this overhead by building a matching automaton. From the above, it sounds like we're spending 10's of percents of time in instcombine, so it may be worth it.</div><div><br></div><div>Without having thought about it too closely, my gut reaction is that I don't think it's a good idea to make an "instcombine V2"; incrementally improving the one we already have (e.g. moving a large portion of its matches into an automaton-based matcher; or conditionalizing (deleting?) checks that aren't worth it) seems like a better long-term approach.</div><div><br></div><div>-- Sean Silva</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div class="HOEnZb"><div class="h5"><br>

> 3) Something peculiar is that we spend 2x time in the inliner. I'm not<br>

> familiar with the inliner, IIRC there were some changes to threshold<br>

> recently, so any help here will be appreciated (both in reproducing<br>

> the results and with analysis).<br>

> 4) For the last testcase (opt -O2 on large unoptimized chunk of<br>

> bitcode) llvm spends 33% of its time in CVP, and very likely in LVI. I<br>

> think it's not as lazy as it claims to be (or at least, the way we use<br>

> it). This doesn't show up in a full LTO run because we don't run CVP<br>

> as part of the default LTO pipeline, but the case shows how LVI can be<br>

> a bottleneck for large TUs (or at least how -O2 compile time degraded<br>

> on large TUs). I haven't thought about the problem very carefully, but<br>

> there seems to be some progress on this front (<br>

> <a href="https://llvm.org/bugs/show_bug.cgi?id=10584" rel="noreferrer" target="_blank">https://llvm.org/bugs/show_<wbr>bug.cgi?id=10584</a>). I can't share the<br>

> original bitcode file but I can probably do some profiling on it as<br>

> well.<br>

><br>

> As next steps I'll try to get a more detailed analysis of the<br>

> problems. In particular, try to do what Rui did for lld but with more<br>

> coarse granularity (every week) to have a chart of the compile time<br>

> trend for these cases over the last 6 months, and post here.<br>

><br>

> I think (I know) some people are aware of the problems I outline in<br>

> this e-mail. But apparently not everybody. We're in a situation where<br>

> compile time is increasing without real control. I'm happy that Apple<br>

> is doing a serious effort to track build-time, so hopefully things<br>

> will improve. There are, although, some cases (like adding matchers in<br>

> instcombine or knobs) where the compile time regression is hard to<br>

> track until it's too late. LLVM as a project tries not to stop people<br>

> trying to get things done and that's great, but from time to time it's<br>

> good to take a step back and re-evaluate approaches.<br>

> The purpose of this e-mail was to outline where we regressed, for<br>

> those interested.<br>

><br>

> Thanks for your time, and of course, feedback welcome!<br>

><br>

> --<br>

> Davide<br>

><br>

> "There are no solved problems; there are only problems that are more<br>

> or less solved" -- Henri Poincare<br>

> ______________________________<wbr>_________________<br>

> LLVM Developers mailing list<br>

> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

<br>

______________________________<wbr>_________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/<wbr>mailman/listinfo/llvm-dev</a><br>

</div></div></blockquote></div><br></div></div>