<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class=""><br class=""><div><blockquote type="cite" class=""><div class="">On Dec 17, 2016, at 6:53 PM, Davide Italiano <<a href="mailto:davide@freebsd.org" class="">davide@freebsd.org</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">On Sat, Dec 17, 2016 at 6:39 PM, Mehdi Amini <</span><a href="mailto:mehdi.amini@apple.com" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class="">mehdi.amini@apple.com</a><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">> wrote:</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><blockquote type="cite" style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px;" class=""><br class=""><blockquote type="cite" class="">On Dec 17, 2016, at 1:35 PM, Davide Italiano via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" class="">llvm-dev@lists.llvm.org</a>> wrote:<br class=""><br class="">First of all, sorry for the long mail.<br class="">Inspired by the excellent analysis Rui did for lld, I decided to do<br class="">the same for llvm.<br class="">I'm personally very interested in build-time for LTO configuration,<br class="">with particular attention to the time spent in the optimizer.<br class="">Rafael did something similar back in March, so this can be considered<br class="">as an update. This tries to include a more accurate high-level<br class="">analysis of where llvm is spending CPU cycles.<br class="">Here I present 2 cases: clang building itself with `-flto` (Full), and<br class="">clang building an internal codebase which I'm going to refer as<br class="">`game7`.<br class="">It's a mid-sized program (it's actually a game), more or less of the<br class="">size of clang, which we use internally as benchmark to track<br class="">compile-time/runtime improvements/regression.<br class="">I picked two random revisions of llvm: trunk (December 16th 2016) and<br class="">trunk (June 2nd 2016), so, roughly, 6 months period.<br class="">My setup is a Mac Pro running Linux (NixOS).<br class="">These are the numbers I collected (including the output of -mllvm -time-passes).<br class="">For clang:<br class=""><br class="">June 2nd:<br class="">real 22m9.278s<br class="">user 21m30.410s<br class="">sys 0m38.834s<br class="">Total Execution Time: 1270.4795 seconds (1269.1330 wall clock)<br class="">289.8102 ( 23.5%) 18.8891 ( 53.7%) 308.6993 ( 24.3%) 308.6906 (<br class="">24.3%) X86 DAG->DAG Instruction Selection<br class="">97.2730 ( 7.9%) 0.7656 ( 2.2%) 98.0386 ( 7.7%) 98.0010 (<br class="">7.7%) Global Value Numbering<br class="">62.4091 ( 5.1%) 0.4779 ( 1.4%) 62.8870 ( 4.9%) 62.8665 (<br class="">5.0%) Function Integration/Inlining<br class="">58.6923 ( 4.8%) 0.4767 ( 1.4%) 59.1690 ( 4.7%) 59.1323 (<br class="">4.7%) Combine redundant instructions<br class="">53.9602 ( 4.4%) 0.6163 ( 1.8%) 54.5765 ( 4.3%) 54.5409 (<br class="">4.3%) Combine redundant instructions<br class="">51.0470 ( 4.1%) 0.5703 ( 1.6%) 51.6173 ( 4.1%) 51.5425 (<br class="">4.1%) Loop Strength Reduction<br class="">47.4067 ( 3.8%) 1.3040 ( 3.7%) 48.7106 ( 3.8%) 48.7034 (<br class="">3.8%) Greedy Register Allocator<br class="">36.7463 ( 3.0%) 0.8133 ( 2.3%) 37.5597 ( 3.0%) 37.4612 (<br class="">3.0%) Induction Variable Simplification<br class="">37.0125 ( 3.0%) 0.2699 ( 0.8%) 37.2824 ( 2.9%) 37.2478 (<br class="">2.9%) Combine redundant instructions<br class="">34.2071 ( 2.8%) 0.2737 ( 0.8%) 34.4808 ( 2.7%) 34.4487 (<br class="">2.7%) Combine redundant instructions<br class="">25.6627 ( 2.1%) 0.3215 ( 0.9%) 25.9842 ( 2.0%) 25.9509 (<br class="">2.0%) Combine redundant instructions<br class=""><br class="">Dec 16th:<br class="">real 27m34.922s<br class="">user 26m53.489s<br class="">sys 0m41.533s<br class=""><br class="">287.5683 ( 18.5%) 19.7048 ( 52.3%) 307.2731 ( 19.3%) 307.2648 (<br class="">19.3%) X86 DAG->DAG Instruction Selection<br class="">197.9211 ( 12.7%) 0.5104 ( 1.4%) 198.4314 ( 12.5%) 198.4091 (<br class="">12.5%) Function Integration/Inlining<br class="">106.9669 ( 6.9%) 0.8316 ( 2.2%) 107.7984 ( 6.8%) 107.7633 (<br class="">6.8%) Global Value Numbering<br class="">89.7571 ( 5.8%) 0.4840 ( 1.3%) 90.2411 ( 5.7%) 90.2067 (<br class="">5.7%) Combine redundant instructions<br class="">79.0456 ( 5.1%) 0.7534 ( 2.0%) 79.7990 ( 5.0%) 79.7630 (<br class="">5.0%) Combine redundant instructions<br class="">55.6393 ( 3.6%) 0.3116 ( 0.8%) 55.9509 ( 3.5%) 55.9187 (<br class="">3.5%) Combine redundant instructions<br class="">51.8663 ( 3.3%) 1.4090 ( 3.7%) 53.2754 ( 3.3%) 53.2684 (<br class="">3.3%) Greedy Register Allocator<br class="">52.5721 ( 3.4%) 0.3021 ( 0.8%) 52.8743 ( 3.3%) 52.8416 (<br class="">3.3%) Combine redundant instructions<br class="">49.0593 ( 3.2%) 0.6101 ( 1.6%) 49.6694 ( 3.1%) 49.5904 (<br class="">3.1%) Loop Strength Reduction<br class="">41.2602 ( 2.7%) 0.9608 ( 2.5%) 42.2209 ( 2.7%) 42.1122 (<br class="">2.6%) Induction Variable Simplification<br class="">38.1438 ( 2.5%) 0.3486 ( 0.9%) 38.4923 ( 2.4%) 38.4603 (<br class="">2.4%) Combine redundant instructions<br class=""><br class="">so, llvm is around 20% slower than it used to be.<br class=""><br class="">For our internal codebase the situation seems slightly worse:<br class=""><br class="">`game7`<br class=""><br class="">June 2nd:<br class=""><br class="">Total Execution Time: 464.3920 seconds (463.8986 wall clock)<br class=""><br class="">88.0204 ( 20.3%) 6.0310 ( 20.0%) 94.0514 ( 20.3%) 94.0473 (<br class="">20.3%) X86 DAG->DAG Instruction Selection<br class="">27.4382 ( 6.3%) 16.2437 ( 53.9%) 43.6819 ( 9.4%) 43.6823 (<br class="">9.4%) X86 Assembly / Object Emitter<br class="">34.9581 ( 8.1%) 0.5274 ( 1.8%) 35.4855 ( 7.6%) 35.4679 (<br class="">7.6%) Function Integration/Inlining<br class="">27.8556 ( 6.4%) 0.3419 ( 1.1%) 28.1975 ( 6.1%) 28.1824 (<br class="">6.1%) Global Value Numbering<br class="">22.1479 ( 5.1%) 0.2258 ( 0.7%) 22.3737 ( 4.8%) 22.3593 (<br class="">4.8%) Combine redundant instructions<br class="">19.2346 ( 4.4%) 0.3639 ( 1.2%) 19.5985 ( 4.2%) 19.5870 (<br class="">4.2%) Post RA top-down list latency scheduler<br class="">15.8085 ( 3.6%) 0.2675 ( 0.9%) 16.0760 ( 3.5%) 16.0614 (<br class="">3.5%) Combine redundant instructions<br class=""><br class="">Dec 16th:<br class=""><br class="">Total Execution Time: 861.0898 seconds (860.5808 wall clock)<br class=""><br class="">135.7207 ( 15.7%) 0.2484 ( 0.8%) 135.9692 ( 15.2%) 135.9531 (<br class="">15.2%) Combine redundant instructions<br class="">103.6609 ( 12.0%) 0.4566 ( 1.4%) 104.1175 ( 11.7%) 104.1014 (<br class="">11.7%) Combine redundant instructions<br class="">97.1083 ( 11.3%) 6.9183 ( 21.8%) 104.0266 ( 11.6%) 104.0181 (<br class="">11.6%) X86 DAG->DAG Instruction Selection<br class="">72.6125 ( 8.4%) 0.1701 ( 0.5%) 72.7826 ( 8.1%) 72.7678 (<br class="">8.1%) Combine redundant instructions<br class="">69.2144 ( 8.0%) 0.6060 ( 1.9%) 69.8204 ( 7.8%) 69.8007 (<br class="">7.8%) Function Integration/Inlining<br class="">60.7837 ( 7.1%) 0.3783 ( 1.2%) 61.1620 ( 6.8%) 61.1455 (<br class="">6.8%) Global Value Numbering<br class="">56.5650 ( 6.6%) 0.1980 ( 0.6%) 56.7630 ( 6.4%) 56.7476 (<br class="">6.4%) Combine redundant instructions<br class=""><br class="">so, using LTO, lld takes 2x to build what it used to take (and all the<br class="">extra time seems spent in the optimizer).<br class=""><br class="">As an (extra) experiment, I decided to take the unoptimized output of<br class="">game7 (via lld -save-temps) and pass to -opt -O2. That shows another<br class="">significant regression (with different characteristics).<br class=""><br class="">June 2nd:<br class="">time opt -O2<br class="">real 6m23.016s<br class="">user 6m20.900s<br class="">sys 0m2.113s<br class=""><br class="">35.9071 ( 10.0%) 0.7996 ( 10.9%) 36.7066 ( 10.0%) 36.6900 ( 10.1%)<br class="">Function Integration/Inlining<br class="">33.4045 ( 9.3%) 0.4053 ( 5.5%) 33.8098 ( 9.3%) 33.7919 ( 9.3%)<br class="">Global Value Numbering<br class="">27.1053 ( 7.6%) 0.5940 ( 8.1%) 27.6993 ( 7.6%) 27.6995 ( 7.6%)<br class="">Bitcode Writer<br class="">25.6492 ( 7.2%) 0.2491 ( 3.4%) 25.8984 ( 7.1%) 25.8805 ( 7.1%)<br class="">Combine redundant instructions<br class="">19.2686 ( 5.4%) 0.2956 ( 4.0%) 19.5642 ( 5.4%) 19.5471 ( 5.4%)<br class="">Combine redundant instructions<br class="">18.6697 ( 5.2%) 0.2625 ( 3.6%) 18.9323 ( 5.2%) 18.9148 ( 5.2%)<br class="">Combine redundant instructions<br class="">16.1294 ( 4.5%) 0.2320 ( 3.2%) 16.3614 ( 4.5%) 16.3434 ( 4.5%)<br class="">Combine redundant instructions<br class="">13.5476 ( 3.8%) 0.3945 ( 5.4%) 13.9421 ( 3.8%) 13.9295 ( 3.8%)<br class="">Combine redundant instructions<br class="">13.1746 ( 3.7%) 0.1767 ( 2.4%) 13.3512 ( 3.7%) 13.3405 ( 3.7%)<br class="">Combine redundant instructions<br class=""><br class="">Dec 16th:<br class=""><br class="">real 20m10.734s<br class="">user 20m8.523s<br class="">sys 0m2.197s<br class=""><br class="">208.8113 ( 17.6%) 0.1703 ( 1.9%) 208.9815 ( 17.5%) 208.9698 (<br class="">17.5%) Value Propagation<br class="">179.6863 ( 15.2%) 0.1215 ( 1.3%) 179.8077 ( 15.1%) 179.7974 (<br class="">15.1%) Value Propagation<br class="">92.0158 ( 7.8%) 0.2674 ( 3.0%) 92.2832 ( 7.7%) 92.2613 (<br class="">7.7%) Combine redundant instructions<br class="">72.3330 ( 6.1%) 0.6026 ( 6.7%) 72.9356 ( 6.1%) 72.9210 (<br class="">6.1%) Combine redundant instructions<br class="">72.2505 ( 6.1%) 0.2167 ( 2.4%) 72.4672 ( 6.1%) 72.4539 (<br class="">6.1%) Combine redundant instructions<br class="">66.6765 ( 5.6%) 0.3482 ( 3.9%) 67.0247 ( 5.6%) 67.0040 (<br class="">5.6%) Combine redundant instructions<br class="">65.5029 ( 5.5%) 0.4092 ( 4.5%) 65.9121 ( 5.5%) 65.8913 (<br class="">5.5%) Combine redundant instructions<br class="">61.8355 ( 5.2%) 0.8150 ( 9.0%) 62.6505 ( 5.2%) 62.6315 (<br class="">5.2%) Function Integration/Inlining<br class="">54.9184 ( 4.6%) 0.3359 ( 3.7%) 55.2543 ( 4.6%) 55.2332 (<br class="">4.6%) Combine redundant instructions<br class="">50.2597 ( 4.2%) 0.2187 ( 2.4%) 50.4784 ( 4.2%) 50.4654 (<br class="">4.2%) Combine redundant instructions<br class="">47.2597 ( 4.0%) 0.3719 ( 4.1%) 47.6316 ( 4.0%) 47.6105 (<br class="">4.0%) Global Value Numbering<br class=""><br class="">I don't have an infrastructure to measure the runtime performance<br class="">benefits/regression of clang, but I have for `game7`.<br class="">I wasn't able to notice any fundamental speedup (at least, not<br class="">something that justifies a 2x build-time).<br class=""><br class="">tl;dr:<br class="">There are quite a few things to notice:<br class="">1) GVN used to be the top pass in the middle-end, in some cases, and<br class="">pretty much always in the top-3. This is not the case anymore, but<br class="">it's still a pass where we spend a lot of time. This is being worked<br class="">on by Daniel Berlin and me) <a href="https://reviews.llvm.org/D26224" class="">https://reviews.llvm.org/D26224</a> so there's<br class="">some hope that will be sorted out (or at least there's a plan for it).<br class="">2) For clang, we spend 35% more time inside instcombine, and for game7<br class="">instcombine seems to largely dominate the amount of time we spend<br class="">optimizing IR. I tried to bisect (which is not easy considering the<br class="">test takes a long time to run), but I wasn't able to identify a single<br class="">point in time responsible for the regression.<br class=""></blockquote><br class="">An efficient way to bisect this is to:<br class=""><br class="">1) dump the IR right before instcombine, and then run only opt -instcombine and confirm the regression shows up.<br class="">2) Then reduce the input: you should be able to single out a single function ultimately. (Maybe with bugpoint or with -opt-bisect-limit)<br class="">3) With a single function that shows the regression, it should be fairly easy to plot the time to run opt -inst-combine for almost every revision between June and now.<br class=""><br class=""></blockquote><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">I tried 1) and I'm able to reproduce the increase in compile time. 2)</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">is on my todolist. I plan to use (and I can see how you can use)</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">bugpoint or delta (with `ulimit`), but it's not entirely clear to me</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">how to reduce using -opt-bisect-limit. As far as I know that just runs</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">passes up to a given point of the pipeline, while here the regression</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">shows up also with a single pass, i.e. opt -instcombine. Can you</span><br style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;" class=""><span style="font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px; float: none; display: inline !important;" class="">please elaborate?</span></div></blockquote><br class=""></div><div>Unless I’m mistaken -opt-bisect-limit operates per execution of a pass. So even if you schedule a single pass it will be efficient, actually bisecting through the execution over each function and may help find which function in the module is causing the largest increase.</div><div><br class=""></div><div>— </div><div>Mehdi</div><div><br class=""></div></body></html>