[LLVMdev] Postponing more passes in LTO

Thu Jan 8 00:55:21 PST 2015

On Thu, Dec 18, 2014 at 2:16 PM, Sean Silva <chisophugis at gmail.com> wrote:

> Okay, so I've cast doubt on the use of standard deviation (and actually an
> argument could be made against the mean too for similar reasons). On a more
> constructive note, what *is* a good small set of parameters to summarize
> measurements of program run time? One interesting model that was brought to
> my attention by Nick Lewycky is based on the following observation: most
> "perturbations" of the run time of a fixed program binary on a fixed chip
> and fixed OS are due to externalities that contend for (or otherwise
> deprive the program of) execution resources and therefore almost always
> make it *slower*.
>
> The model is then that there is a "true" execution time for the program,
> which is a completely deterministic function of the program, chip, and OS;
> intuitively, this is time occurs when the program is e.g. able to execute
> by itself, on a single execution unit of the CPU, with no interrupts
> knocking it off CPU, and no other threads using any of the execution
> resources of this CPU. On top of this, we assume that there is a set of
> perturbations that contend for execution resources and slow the program
> down.
>
> For this model, it makes sense to approximate the "true" value by the
> maximum of the recorded values. If we want to use two numbers to summarize
> the data, it probably makes sense to look at a way of characterizing the
> extent of the variation from this "true" value. In the absence of proper
> characterization of the perturbations (such as the distribution of when
> they occur and how much effect they have when they occur), one simple
> solution is the minimum. Max and min might be *too* sensitive to the
> perturbations (especially disk cache of the program itself on the first
> run), so a pair of percentiles might be a bit more reliable, such as 10th
> percentile and 90th percentile.
>

I used to subscribe to this thinking. I no longer do, and I think it has
serious problems. Let me try to explain with a hopefully entertaining
analogy. =D

Back in physics class we would analyze the behavior of basic principles of
physics as they would act upon a brave cadre of scientifically savvy
penguins sliding down slopes under the effect of gravity. Naturally, these
slopes were very slick ice, and the physics problems would be solved
"neglecting friction"[1]. That made the analysis much more tractable and
predictable. It also makes it really inaccurate in the real world.

Now, I'm not actually deriding this approach to physics. The results from
neglecting friction are *very* informative as to the basic principles at
work. In a very similar way, I don't think the model of a "true" execution
time is very informative, but much more about the principles at work than
about the actual performance of the software.

The reality I have observed these days, especially in larger software
systems, is that there is inherently no deterministic and representative
execution to rely on for finding "true" execution time. And too many of the
random fluctuations are actually by-products of the program itself. Some
examples:

- If the program runs uninterrupted on a single core, it may cause that
core to be throttled due to the thermals generated. This can have the
effect of lowering the overall performance more than what you would see if
the program ran *just* under that level without repeated drops in frequency.

- Different optimizations can change the thermal profile by using different
execution units. This is especially true for vector units.

- If you fully leverage the AVX unit of some intel chips, my understanding
is that regardless of the specific thermals, those units run at a lower
clock!

- Naturally, things like ASLR or other sources of correct but
non-deterministic layout of data in memory will cause fluctuations in the
rate at which cache misses occur. This is probably the most common cause of
non-external fluctuations in performance. Different generated code,
register spill frequencies, etc., can dramatically change these cache
collision rates.

Now, with all of these factors, we have to assume that in real world
scenarios, the programs will not always run at the maximal speed. Instead,
there is, much as you describe in your email, some distribution of
performance that will be observed in practice. The interesting question is
-- can generating different code from the compiler shift the distribution
in a meaningful way that is not correlated with the fastest measurement (or
the N fastest measurement). And I think the above examples are all places
where this can happen. =/

As a consequence, I am increasingly in favor of using very simple but
aggregated statistics for measuring performance and acknowledging the lack
of precision when it is present. We can still very often measure precisely
those localized improvements provided by a specific change even when very
small. We just won't see a reflection from them in many cases when
measuring large systems that have a high noise floor. Hopefully, we can at
least observe good trends in the large system to confirm we are making
progress, and at times we can make large improvements that actually rise
above the noise floor.

Anyways, my somewhat rambling 2 cents on this. Sorry for the not entirely
structured email. =]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150108/7513e845/attachment.html>