<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Mar 9, 2016 at 12:38 PM, Xinliang David Li <span dir="ltr"><<a href="mailto:xinliangli@gmail.com" target="_blank">xinliangli@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr">The lto time could be explained by second order effect due to increased dcache/dtlb pressures due to increased memory footprint and poor locality.</div></blockquote><div><br></div><div>Actually thinking more about this, I was totally wrong. Mehdi said that we LTO ~56 binaries. If we naively assume that each binary is like clang and links in "everything" and that the LTO process takes CPU time equivalent to "-O3 for every TU", then we would expect that *for each binary* we would see +33% (total increase >1800% vs Release). Clearly that is not happening since the actual overhead is only 50%-100%, so we need a more refined explanation.</div><div><br></div><div>There are a couple factors that I can think of.</div><div>a) there are 56 binaries being LTO'd (this will tend to increase our estimate)</div><div>b) not all 56 binaries are the size of clang (this will tend to decrease our estimate)</div><div><div>c) per-TU processing only is doing mid-level optimizations and no codegen (this will tend to decrease our estimate)</div></div><div>d) IR seen during LTO has already been "cleaned up" and has less overall size/amount of optimizations that will apply during the LTO process (this will tend to decrease our estimate)<br></div><div>e) comdat folding in the linker means that we only codegen (this will tend to decrease our estimate)<br></div><div><br></div><div>Starting from a (normalized) release build with</div><div>releaseBackend = .33</div><div>releaseFrontend = .67</div><div>release = releaseBackend + releaseFrontend  = 1</div><div><br></div><div>Let us try to obtain</div><div>LTO = (some expression involving releaseFrontend and releaseBackend) = 1.5-2</div><div><br></div><div>For starters, let us apply a), with a naive assumption that for each of the numBinaries = 52 binaries we add the cost of releaseBackend (I just checked and 52 is the exact number for LLVM+Clang+LLD+clang-tools-extra, ignoring symlinks). This gives</div><div>LTO = release + 52 * releaseBackend = 21.46, which is way high.</div><div><br></div><div>Let us apply b). A quick check gives 371,515,392 total bytes of text in a release build across all 52 binaries (Mac, x86_64). Clang is 45,182,976 bytes of text. So using final text size in Release as an indicator of the total code seen by the LTO process, we can use a coefficient of 1/8, i.e. the average binary links in about avgTextFraction = 1/8 of "everything".</div><div>LTO = release + 52 * (.125 * releaseBackend) = 3.14<br></div><div><br></div><div>We are still high. For c), Let us assume that half of releaseBackend is spend after mid-level optimizations. So let codegenFraction = .5 be the fraction of releaseBackend that is spend after mid-level optimizations. We can discount this time from the LTO build since it does not that work per-TU.</div><div>LTO = release + 52 * (.125 * releaseBackend) - (codegenFraction * releaseBackend) = 2.98<br></div><div>So this is not a significant reduction.</div><div><br></div><div>I don't have a reasonable estimate a priori for d) or e), but altogether they reduce to a constant factor otherSavingsFraction that multiplies the second term</div><div>LTO = release + 52 * (.125 * otherSavingsFraction * releaseBackend) - (codegenFraction * releaseBackend) =? 1.5-2x<br></div><div><br></div><div>Given the empirical data, this suggests that otherSavingsFraction must have a value around 1/2, which seems reasonable.</div><div><br></div><div>For a moment I was rather surprised that we could have 52 binaries and it would be only 2x, but this closer examination shows that between avgTextFraction = .125 and releaseBackend = .33 the "52" is brought under control.</div><div><br></div><div>-- Sean Silva</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div dir="ltr"><span class=""><font color="#888888"><div><br></div><div>David</div></font></span></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On Tue, Mar 8, 2016 at 5:47 PM, Sean Silva via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><div class="h5"><div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote"><div><div>On Tue, Mar 8, 2016 at 2:25 PM, Mehdi Amini <span dir="ltr"><<a href="mailto:mehdi.amini@apple.com" target="_blank">mehdi.amini@apple.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><div><div><br><div><blockquote type="cite"><div>On Mar 8, 2016, at 1:09 PM, Sean Silva via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:</div><br><div><div dir="ltr" style="font-family:Helvetica;font-size:12px;font-style:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px"><div class="gmail_extra"><br><br><div class="gmail_quote">On Tue, Mar 8, 2016 at 10:42 AM, Richard Smith via llvm-dev<span> </span><span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span><span> </span>wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><span>On Tue, Mar 8, 2016 at 8:13 AM, Rafael Espíndola<br><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br>> I have just benchmarked building trunk llvm and clang in Debug,<br>> Release and LTO modes (see the attached scrip for the cmake lines).<br>><br>> The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all<br>> cases I used the system libgcc and libstdc++.<br>><br>> For release builds there is a monotonic increase in each version. From<br>> 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc<br>> 5.3.2 takes 205 minutes.<br>><br>> Debug and LTO show an improvement in 3.7, but have regressed again in 3.8.<br><br></span>I'm curious how these times divide across Clang and various parts of<br>LLVM; rerunning with -ftime-report and summing the numbers across all<br>compiles could be interesting.<br></blockquote><div><br></div><div>Based on the results I posted upthread about the relative time spend in the backend for debug vs release, we can estimate this.</div><div>To summarize:</div><div>10% of time spent in LLVM for Debug</div><div>33% of time spent in LLVM for Release</div><div>(I'll abbreviate "in LLVM" as just "backend"; this is "backend" from clang's perspective)</div><div><br></div><div>Let's look at the difference between 3.5 and trunk.</div><div><br></div><div>For debug, the user time jumps from 174m50.251s to 197m9.932s.</div><div>That's {10490.3, 11829.9} seconds, respectively.</div><div>For release, the corresponding numbers are:</div><div>{9826.71, 12714.3} seconds.<br></div><div><br></div><div>debug35 = 10490.251<br></div><div>debugTrunk = 11829.932<br></div><div><br></div><div>debugTrunk/debug35 == 1.12771<br></div><div>debugRatio = 1.12771</div><div><br></div><div>release35 = 9826.705<br></div><div>releaseTrunk = 12714.288<br></div><div><br></div><div>releaseTrunk/release35 == 1.29385<br></div><div>releaseRatio = 1.29385</div><div><br></div><div>For simplicity, let's use a simple linear model for the distribution of slowdown between the frontend and backend: a constant factor slowdown for the backend, and an independent constant factor slowdown for the frontend. This gives the following linear system:</div><div>debugRatio = .1 * backendRatio + (1 - .1) * frontendRatio</div><div>releaseRatio = .33 * backendRatio + (1 - .33) * frontendRatio</div><div><br></div><div>Solving this linear system we find that under this simple model, the expected slowdown factors are:</div><div>backendRatio = 1.77783<br></div><div>frontendRatio = 1.05547</div><div><br></div><div>Intuitively, backendRatio comes out larger in this comparison because we see the biggest slowdown during release (1.29 vs 1.12), and during release we are spending a larger fraction of time in the backend (33% vs 10%).</div><div><br></div><div>Applying this same model to across Rafael's data, we find the following (numbers have been rounded for clarity):</div><div><br></div><div><div><font face="monospace, monospace">transition       backendRatio   frontendRatio</font></div><div><font face="monospace, monospace">3.5->3.6         1.08           1.03</font></div><div><font face="monospace, monospace">3.6->3.7         1.30           0.95</font></div><div><font face="monospace, monospace">3.7->3.8         1.34           1.07</font></div><div><font face="monospace, monospace">3.8->trunk       0.98           1.02</font><span style="font-family:monospace,monospace">                </span></div></div><div><div><br></div></div><div><div>Note that in Rafael's measurements LTO is pretty similar to Release from a CPU time (user time) standpoint. While the final LTO link takes a large amount of real time, it is single threaded. Based on the real time numbers the LTO link was only spending about 20 minutes single-threaded (i.e. about 20 minutes CPU time), which is pretty small compared to the 300-400 minutes of total CPU time. It would be interesting to see the numbers for -O0 or -O1 per-TU together with LTO.</div></div></div></div></div></div></blockquote><div><br></div><div><br></div></div></div></div>Just a note about LTO being sequential: Rafael mentioned he was "building trunk llvm and clang". By default I believe it is ~56 link targets that can be run in parallel (provided you have enough RAM to avoid swapping).</div></blockquote><div><br></div></div></div><div>D'oh! I was looking at the data wrong since I broke my Fundamental Rule of Looking At Data, namely: don't look at raw numbers in a table since you are likely to look at things wrong or form biases based on the order in which you look at the data points; *always* visualize. There is a significant difference between release and LTO. About 2x consistently.</div><div><br></div><div><img src="cid:ii_15358fe40a6fd5cd" alt="Inline image 3" style="margin-right: 25px;"><br></div><div><br></div><div>This is actually curious because during the release build, we were spending 33% of CPU time in the backend (as clang sees it; i.e. mid-level optimizer and codegen). This data is inconsistent with LTO simply being another run through the backend (which would be just +33% CPU time at worst). There seems to be something nonlinear happening.</div><div>To make it worse, the LTO build has approximately a full Release optimization running per-TU, so the actual LTO step should be seeing inlined/"cleaned up" IR which should be much smaller than what the per-TU optimizer is seeing, so naively it should take *even less* than "another 33% CPU time" chunk.</div><div>Yet we see 1.5x-2x difference:</div><div><br></div><div><img src="cid:ii_153590d95038676b" alt="Inline image 4" style="margin-right: 25px;"><br></div><div><br></div><div>-- Sean Silva</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div style="word-wrap:break-word"><span><font color="#888888"><div><br></div><span><font color="#888888"><div>-- </div><div>Mehdi</div><div><br></div></font></span></font></span></div></blockquote></div><br></div></div>

<br></div></div><span class="">_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

<br></span></blockquote></div><br></div>

</blockquote></div><br></div></div>