[llvm-dev] llvm and clang are getting slower

Sean Silva via llvm-dev llvm-dev at lists.llvm.org
Wed Mar 9 13:55:06 PST 2016


On Wed, Mar 9, 2016 at 12:38 PM, Xinliang David Li <xinliangli at gmail.com>
wrote:

> The lto time could be explained by second order effect due to increased
> dcache/dtlb pressures due to increased memory footprint and poor locality.
>

Actually thinking more about this, I was totally wrong. Mehdi said that we
LTO ~56 binaries. If we naively assume that each binary is like clang and
links in "everything" and that the LTO process takes CPU time equivalent to
"-O3 for every TU", then we would expect that *for each binary* we would
see +33% (total increase >1800% vs Release). Clearly that is not happening
since the actual overhead is only 50%-100%, so we need a more refined
explanation.

There are a couple factors that I can think of.
a) there are 56 binaries being LTO'd (this will tend to increase our
estimate)
b) not all 56 binaries are the size of clang (this will tend to decrease
our estimate)
c) per-TU processing only is doing mid-level optimizations and no codegen
(this will tend to decrease our estimate)
d) IR seen during LTO has already been "cleaned up" and has less overall
size/amount of optimizations that will apply during the LTO process (this
will tend to decrease our estimate)
e) comdat folding in the linker means that we only codegen (this will tend
to decrease our estimate)

Starting from a (normalized) release build with
releaseBackend = .33
releaseFrontend = .67
release = releaseBackend + releaseFrontend  = 1

Let us try to obtain
LTO = (some expression involving releaseFrontend and releaseBackend) = 1.5-2

For starters, let us apply a), with a naive assumption that for each of the
numBinaries = 52 binaries we add the cost of releaseBackend (I just checked
and 52 is the exact number for LLVM+Clang+LLD+clang-tools-extra, ignoring
symlinks). This gives
LTO = release + 52 * releaseBackend = 21.46, which is way high.

Let us apply b). A quick check gives 371,515,392 total bytes of text in a
release build across all 52 binaries (Mac, x86_64). Clang is 45,182,976
bytes of text. So using final text size in Release as an indicator of the
total code seen by the LTO process, we can use a coefficient of 1/8, i.e.
the average binary links in about avgTextFraction = 1/8 of "everything".
LTO = release + 52 * (.125 * releaseBackend) = 3.14

We are still high. For c), Let us assume that half of releaseBackend is
spend after mid-level optimizations. So let codegenFraction = .5 be the
fraction of releaseBackend that is spend after mid-level optimizations. We
can discount this time from the LTO build since it does not that work
per-TU.
LTO = release + 52 * (.125 * releaseBackend) - (codegenFraction *
releaseBackend) = 2.98
So this is not a significant reduction.

I don't have a reasonable estimate a priori for d) or e), but altogether
they reduce to a constant factor otherSavingsFraction that multiplies the
second term
LTO = release + 52 * (.125 * otherSavingsFraction * releaseBackend) -
(codegenFraction * releaseBackend) =? 1.5-2x

Given the empirical data, this suggests that otherSavingsFraction must have
a value around 1/2, which seems reasonable.

For a moment I was rather surprised that we could have 52 binaries and it
would be only 2x, but this closer examination shows that between
avgTextFraction = .125 and releaseBackend = .33 the "52" is brought under
control.

-- Sean Silva


>
> David
>
> On Tue, Mar 8, 2016 at 5:47 PM, Sean Silva via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>>
>> On Tue, Mar 8, 2016 at 2:25 PM, Mehdi Amini <mehdi.amini at apple.com>
>> wrote:
>>
>>>
>>> On Mar 8, 2016, at 1:09 PM, Sean Silva via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>
>>>
>>> On Tue, Mar 8, 2016 at 10:42 AM, Richard Smith via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> On Tue, Mar 8, 2016 at 8:13 AM, Rafael EspĂ­ndola
>>>> <llvm-dev at lists.llvm.org> wrote:
>>>> > I have just benchmarked building trunk llvm and clang in Debug,
>>>> > Release and LTO modes (see the attached scrip for the cmake lines).
>>>> >
>>>> > The compilers used were clang 3.5, 3.6, 3.7, 3.8 and trunk. In all
>>>> > cases I used the system libgcc and libstdc++.
>>>> >
>>>> > For release builds there is a monotonic increase in each version. From
>>>> > 163 minutes with 3.5 to 212 minutes with trunk. For comparison, gcc
>>>> > 5.3.2 takes 205 minutes.
>>>> >
>>>> > Debug and LTO show an improvement in 3.7, but have regressed again in
>>>> 3.8.
>>>>
>>>> I'm curious how these times divide across Clang and various parts of
>>>> LLVM; rerunning with -ftime-report and summing the numbers across all
>>>> compiles could be interesting.
>>>>
>>>
>>> Based on the results I posted upthread about the relative time spend in
>>> the backend for debug vs release, we can estimate this.
>>> To summarize:
>>> 10% of time spent in LLVM for Debug
>>> 33% of time spent in LLVM for Release
>>> (I'll abbreviate "in LLVM" as just "backend"; this is "backend" from
>>> clang's perspective)
>>>
>>> Let's look at the difference between 3.5 and trunk.
>>>
>>> For debug, the user time jumps from 174m50.251s to 197m9.932s.
>>> That's {10490.3, 11829.9} seconds, respectively.
>>> For release, the corresponding numbers are:
>>> {9826.71, 12714.3} seconds.
>>>
>>> debug35 = 10490.251
>>> debugTrunk = 11829.932
>>>
>>> debugTrunk/debug35 == 1.12771
>>> debugRatio = 1.12771
>>>
>>> release35 = 9826.705
>>> releaseTrunk = 12714.288
>>>
>>> releaseTrunk/release35 == 1.29385
>>> releaseRatio = 1.29385
>>>
>>> For simplicity, let's use a simple linear model for the distribution of
>>> slowdown between the frontend and backend: a constant factor slowdown for
>>> the backend, and an independent constant factor slowdown for the frontend.
>>> This gives the following linear system:
>>> debugRatio = .1 * backendRatio + (1 - .1) * frontendRatio
>>> releaseRatio = .33 * backendRatio + (1 - .33) * frontendRatio
>>>
>>> Solving this linear system we find that under this simple model, the
>>> expected slowdown factors are:
>>> backendRatio = 1.77783
>>> frontendRatio = 1.05547
>>>
>>> Intuitively, backendRatio comes out larger in this comparison because we
>>> see the biggest slowdown during release (1.29 vs 1.12), and during release
>>> we are spending a larger fraction of time in the backend (33% vs 10%).
>>>
>>> Applying this same model to across Rafael's data, we find the following
>>> (numbers have been rounded for clarity):
>>>
>>> transition       backendRatio   frontendRatio
>>> 3.5->3.6         1.08           1.03
>>> 3.6->3.7         1.30           0.95
>>> 3.7->3.8         1.34           1.07
>>> 3.8->trunk       0.98           1.02
>>>
>>> Note that in Rafael's measurements LTO is pretty similar to Release from
>>> a CPU time (user time) standpoint. While the final LTO link takes a large
>>> amount of real time, it is single threaded. Based on the real time numbers
>>> the LTO link was only spending about 20 minutes single-threaded (i.e. about
>>> 20 minutes CPU time), which is pretty small compared to the 300-400 minutes
>>> of total CPU time. It would be interesting to see the numbers for -O0 or
>>> -O1 per-TU together with LTO.
>>>
>>>
>>>
>>> Just a note about LTO being sequential: Rafael mentioned he was
>>> "building trunk llvm and clang". By default I believe it is ~56 link
>>> targets that can be run in parallel (provided you have enough RAM to avoid
>>> swapping).
>>>
>>
>> D'oh! I was looking at the data wrong since I broke my Fundamental Rule
>> of Looking At Data, namely: don't look at raw numbers in a table since you
>> are likely to look at things wrong or form biases based on the order in
>> which you look at the data points; *always* visualize. There is a
>> significant difference between release and LTO. About 2x consistently.
>>
>> [image: Inline image 3]
>>
>> This is actually curious because during the release build, we were
>> spending 33% of CPU time in the backend (as clang sees it; i.e. mid-level
>> optimizer and codegen). This data is inconsistent with LTO simply being
>> another run through the backend (which would be just +33% CPU time at
>> worst). There seems to be something nonlinear happening.
>> To make it worse, the LTO build has approximately a full Release
>> optimization running per-TU, so the actual LTO step should be seeing
>> inlined/"cleaned up" IR which should be much smaller than what the per-TU
>> optimizer is seeing, so naively it should take *even less* than "another
>> 33% CPU time" chunk.
>> Yet we see 1.5x-2x difference:
>>
>> [image: Inline image 4]
>>
>> -- Sean Silva
>>
>>
>>>
>>> --
>>> Mehdi
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/f27e6d77/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-08 at 5.29.21 PM.png
Type: image/png
Size: 36008 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/f27e6d77/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screen Shot 2016-03-08 at 5.45.54 PM.png
Type: image/png
Size: 39766 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20160309/f27e6d77/attachment-0003.png>


More information about the llvm-dev mailing list