[llvm-dev] llvm and clang are getting slower

Xinliang David Li via llvm-dev llvm-dev at lists.llvm.org
Wed Mar 9 14:31:47 PST 2016


On Wed, Mar 9, 2016 at 1:55 PM, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Wed, Mar 9, 2016 at 12:38 PM, Xinliang David Li <xinliangli at gmail.com>
> wrote:
>
>> The lto time could be explained by second order effect due to increased
>> dcache/dtlb pressures due to increased memory footprint and poor locality.
>>
>
> Actually thinking more about this, I was totally wrong. Mehdi said that we
> LTO ~56 binaries. If we naively assume that each binary is like clang and
> links in "everything" and that the LTO process takes CPU time equivalent to
> "-O3 for every TU", then we would expect that *for each binary* we would
> see +33% (total increase >1800% vs Release). Clearly that is not happening
> since the actual overhead is only 50%-100%, so we need a more refined
> explanation.
>
> There are a couple factors that I can think of.
> a) there are 56 binaries being LTO'd (this will tend to increase our
> estimate)
> b) not all 56 binaries are the size of clang (this will tend to decrease
> our estimate)
> c) per-TU processing only is doing mid-level optimizations and no codegen
> (this will tend to decrease our estimate)
> d) IR seen during LTO has already been "cleaned up" and has less overall
> size/amount of optimizations that will apply during the LTO process (this
> will tend to decrease our estimate)
> e) comdat folding in the linker means that we only codegen (this will tend
> to decrease our estimate)
>
> Starting from a (normalized) release build with
> releaseBackend = .33
> releaseFrontend = .67
> release = releaseBackend + releaseFrontend  = 1
>
> Let us try to obtain
> LTO = (some expression involving releaseFrontend and releaseBackend) =
> 1.5-2
>
> For starters, let us apply a), with a naive assumption that for each of
> the numBinaries = 52 binaries we add the cost of releaseBackend (I just
> checked and 52 is the exact number for LLVM+Clang+LLD+clang-tools-extra,
> ignoring symlinks). This gives
> LTO = release + 52 * releaseBackend = 21.46, which is way high.
>

Some bitcode .o files (such as in support libs) are linked in by more than
one targets, but not all .o files are. Suppose the average duplication
factor is DupFactor, then LTO time should be approximated by

LTO = releaseFrontend + DupFactor*ReleaseBackend

Consider comdat elimination and let DedupFactor is the ratio of total
number of unique functions over total number of functions produced by FE,
the LTO time is approximated by:

LTO = releaseFrontend + DupFactor*DedupFactor*ReleaseBackend

David



>
> Let us apply b). A quick check gives 371,515,392 total bytes of text in a
> release build across all 52 binaries (Mac, x86_64). Clang is 45,182,976
> bytes of text. So using final text size in Release as an indicator of the
> total code seen by the LTO process, we can use a coefficient of 1/8, i.e.
> the average binary links in about avgTextFraction = 1/8 of "everything".
> LTO = release + 52 * (.125 * releaseBackend) = 3.14
>
> We are still high. For c), Let us assume that half of releaseBackend is
> spend after mid-level optimizations. So let codegenFraction = .5 be the
> fraction of releaseBackend that is spend after mid-level optimizations. We
> can discount this time from the LTO build since it does not that work
> per-TU.
> LTO = release + 52 * (.125 * releaseBackend) - (codegenFraction *
> releaseBackend) = 2.98
> So this is not a significant reduction.
>
> I don't have a reasonable estimate a priori for d) or e), but altogether
> they reduce to a constant factor otherSavingsFraction that multiplies the
> second term
> LTO = release + 52 * (.125 * otherSavingsFraction * releaseBackend) -
> (codegenFraction * releaseBackend) =? 1.5-2x
>
> Given the empirical data, this suggests that otherSavingsFraction must
> have a value around 1/2, which seems reasonable.
>
> For a moment I was rather surprised that we could have 52 binaries and it
> would be only 2x, but this closer examination shows that between
> avgTextFraction = .125 and releaseBackend = .33 the "52" is brought under
> control.
>
> -- Sean Silva
>
>
>>
>> David
>>
>> On Tue, Mar 8, 2016 at 5:47 PM, Sean Silva via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>>
>>>
>>> On Tue, Mar 8, 2016 at 2:25 PM, Mehdi Amini <mehdi.amini at apple.com>
>>> wrote:
>>>
>>>>
>>>> On Mar 8, 2016, at 1:09 PM, Sean Silva via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>
>>>>
>>>> On Tue, Mar 8, 2016 at 10:42 AM, Richard Smith via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> On Tue, Mar 8, 2016 at 8:13 AM, Rafael EspĂ­ndola
>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>> > I have just benchmarked building trunk llvm and clang in Debug,
>>>>> > Release and LTO modes (see the attached scrip for the cmake lines).
>>>>> >
>>>>> > The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all
>>>>> > cases I used the system libgcc and libstdc++.
>>>>> >
>>>>> > For release builds there is a monotonic increase in each version.
>>>>> From
>>>>> > 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc
>>>>> > 5.3.2 takes 205 minutes.
>>>>> >
>>>>> > Debug and LTO show an improvement in 3.7, but have regressed again
>>>>> in 3.8.
>>>>>
>>>>> I'm curious how these times divide across Clang and various parts of
>>>>> LLVM; rerunning with -ftime-report and summing the numbers across all
>>>>> compiles could be interesting.
>>>>>
>>>>
>>>> Based on the results I posted upthread about the relative time spend in
>>>> the backend for debug vs release, we can estimate this.
>>>> To summarize:
>>>> 10% of time spent in LLVM for Debug
>>>> 33% of time spent in LLVM for Release
>>>> (I'll abbreviate "in LLVM" as just "backend"; this is "backend" from
>>>> clang's perspective)
>>>>
>>>> Let's look at the difference between 3.5 and trunk.
>>>>
>>>> For debug, the user time jumps from 174m50.251s to 197m9.932s.
>>>> That's {10490.3, 11829.9} seconds, respectively.
>>>> For release, the corresponding numbers are:
>>>> {9826.71, 12714.3} seconds.
>>>>
>>>> debug35 = 10490.251
>>>> debugTrunk = 11829.932
>>>>
>>>> debugTrunk/debug35 == 1.12771
>>>> debugRatio = 1.12771
>>>>
>>>> release35 = 9826.705
>>>> releaseTrunk = 12714.288
>>>>
>>>> releaseTrunk/release35 == 1.29385
>>>> releaseRatio = 1.29385
>>>>
>>>> For simplicity, let's use a simple linear model for the distribution of
>>>> slowdown between the frontend and backend: a constant factor slowdown for
>>>> the backend, and an independent constant factor slowdown for the frontend.
>>>> This gives the following linear system:
>>>> debugRatio = .1 * backendRatio + (1 - .1) * frontendRatio
>>>> releaseRatio = .33 * backendRatio + (1 - .33) * frontendRatio
>>>>
>>>> Solving this linear system we find that under this simple model, the
>>>> expected slowdown factors are:
>>>> backendRatio = 1.77783
>>>> frontendRatio = 1.05547
>>>>
>>>> Intuitively, backendRatio comes out larger in this comparison because
>>>> we see the biggest slowdown during release (1.29 vs 1.12), and during
>>>> release we are spending a larger fraction of time in the backend (33% vs
>>>> 10%).
>>>>
>>>> Applying this same model to across Rafael's data, we find the following
>>>> (numbers have been rounded for clarity):
>>>>
>>>> transition       backendRatio   frontendRatio
>>>> 3.5->3.6         1.08           1.03
>>>> 3.6->3.7         1.30           0.95
>>>> 3.7->3.8         1.34           1.07
>>>> 3.8->trunk       0.98           1.02
>>>>
>>>> Note that in Rafael's measurements LTO is pretty similar to Release
>>>> from a CPU time (user time) standpoint. While the final LTO link takes a
>>>> large amount of real time, it is single threaded. Based on the real time
>>>> numbers the LTO link was only spending about 20 minutes single-threaded
>>>> (i.e. about 20 minutes CPU time), which is pretty small compared to the
>>>> 300-400 minutes of total CPU time. It would be interesting to see the
>>>> numbers for -O0 or -O1 per-TU together with LTO.
>>>>
>>>>
>>>>
>>>> Just a note about LTO being sequential: Rafael mentioned he was
>>>> "building trunk llvm and clang". By default I believe it is ~56 link
>>>> targets that can be run in parallel (provided you have enough RAM to avoid
>>>> swapping).
>>>>
>>>
>>> D'oh! I was looking at the data wrong since I broke my Fundamental Rule
>>> of Looking At Data, namely: don't look at raw numbers in a table since you
>>> are likely to look at things wrong or form biases based on the order in
>>> which you look at the data points; *always* visualize. There is a
>>> significant difference between release and LTO. About 2x consistently.
>>>
>>> [image: Inline image 3]
>>>
>>> This is actually curious because during the release build, we were
>>> spending 33% of CPU time in the backend (as clang sees it; i.e. mid-level
>>> optimizer and codegen). This data is inconsistent with LTO simply being
>>> another run through the backend (which would be just +33% CPU time at
>>> worst). There seems to be something nonlinear happening.
>>> To make it worse, the LTO build has approximately a full Release
>>> optimization running per-TU, so the actual LTO step should be seeing
>>> inlined/"cleaned up" IR which should be much smaller than what the per-TU
>>> optimizer is seeing, so naively it should take *even less* than "another
>>> 33% CPU time" chunk.
>>> Yet we see 1.5x-2x difference:
>>>
>>> [image: Inline image 4]
>>>
>>> -- Sean Silva
>>>
>>>
>>>>
>>>> --
>>>> Mehdi
>>>>
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/1292e710/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-08 at 5.29.21 PM.png
Type: image/png
Size: 36008 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/1292e710/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-08 at 5.45.54 PM.png
Type: image/png
Size: 39766 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/1292e710/attachment-0003.png>


More information about the llvm-dev mailing list