[LLVMdev] Postponing more passes in LTO

Wed Jan 7 22:58:48 PST 2015

Sean Silva wrote:
> Also, a couple notes about the use of standard deviation (and general
> stuff about benchmarking):
>
> 1. It is *vital* to keep in mind the distinction between a "sample
> statistic" and a "population statistic". A sample statistic is a
> statistic derived from the dataset of measured values, while a
> population statistic is a mathematical function of the underlying
> probability distribution of the phenomenon (some abstract "population"
> of possibilities) being sampled.
>
> 2. As you take more measurements, the standard deviation sample
> statistic converges to the standard deviation population statistic
> regardless of the underlying population distribution.
>
> 3. *Every* probability distribution has a defined standard deviation (it
> is just a mathematical function of the distribution). For a Gaussian
> distribution, the standard deviation (in conjunction with the mean)
> completely describes the probability distribution. This is emphatically
> *not* the case for all probability distributions [1]. In order to derive
> insight from a standard deviation sample statistic, you *must* a priori
> have a good reason to believe that the underlying probability
> distribution's standard deviation provides useful information about the
> actual distribution.
>
> 4. Any probability distribution can be made to have any standard
> deviation without changing its distribution shape (technically, at most
> a linear transformation of the coordinate is needed). To pick a random
> example, for a highly bimodal distribution the standard deviation
> doesn't really give you any useful information about the overall shape
> (and it doesn't exactly correspond between the spacing between the modes
> either).
>
> 5. The standard deviation is nothing more than a root-mean-square
> average. Basically, it is a very popular statistic because there is an
> *extremely powerful* way to model phenomena for which the
> root-mean-square average is the actual meaningful value that ties into
> the model. The theory I'm talking about is used extensively in almost
> every science and engineering discipline *outside of* software
> engineering (and the purely digital part of hardware) [2]. Since this
> modeling technique is rarely directly relevant in software, there is
> little reason to assume that the standard deviation is a good statistic
> (there may be a good reason to, but I haven't seen one).
>
> Okay, so I've cast doubt on the use of standard deviation (and actually
> an argument could be made against the mean too for similar reasons). On
> a more constructive note, what *is* a good small set of parameters to
> summarize measurements of program run time? One interesting model that
> was brought to my attention by Nick Lewycky is based on the following
> observation: most "perturbations" of the run time of a fixed program
> binary on a fixed chip and fixed OS are due to externalities that
> contend for (or otherwise deprive the program of) execution resources
> and therefore almost always make it *slower*.
>
> The model is then that there is a "true" execution time for the program,
> which is a completely deterministic function of the program, chip, and
> OS; intuitively, this is time occurs when the program is e.g. able to
> execute by itself, on a single execution unit of the CPU, with no
> interrupts knocking it off CPU, and no other threads using any of the
> execution resources of this CPU. On top of this, we assume that there is
> a set of perturbations that contend for execution resources and slow the
> program down.
>
> For this model, it makes sense to approximate the "true" value by the
> maximum of the recorded values. If we want to use two numbers to
> summarize the data, it probably makes sense to look at a way of
> characterizing the extent of the variation from this "true" value. In
> the absence of proper characterization of the perturbations (such as the
> distribution of when they occur and how much effect they have when they
> occur), one simple solution is the minimum. Max and min might be *too*
> sensitive to the perturbations (especially disk cache of the program
> itself on the first run), so a pair of percentiles might be a bit more
> reliable, such as 10th percentile and 90th percentile.
>
> Next time I'm measuring program run time, I'm going to try out this
> model. It should be possible to look at a higher-degree-of-freedom
> representation of the dataset, like a histogram of the distribution, in
> order to evaluate how well this two-parameter characterization of the
> data works compared with the typical mean and stddev. Also, ultimately
> we only need as much precision here as is necessary for what we are
> doing with the summarized data, which typically is to alert us of a
> significant change. On a sufficiently quiet machine, the variance of the
> measurements might be significantly smaller than the thresholds that we
> consider "significant" so that the mean, median, max, min, 10%-ile etc.
> are all so close that it does not matter which we use, and we can
> summarize the data with just a single number, which can be any one of
> them (and just trigger an error if total variation exceeds a set amount).
>
> However, the noisier the machine (and hence our data), the more
> important is is to properly model and analyze things to avoid coming to
> a faulty conclusion (note: reliable conclusions can be made from data
> with large amounts of noise as long as there is a good model which
> allows isolating the data we are interested in from the noise [3]).
>
> Nick, I forget from when we talked, but did you guys ever settle on a
> model like this for your internal benchmarking?

We have not. Our current system (very quiesced machines) works well 
enough for our current purposes.

Nick

> [1] If you are familiar with Fourier theory, using only two statistics
> to describe a probability distribution is sort of like using only two
> Fourier components to describe a function. In order to conclude
> *anything* you must a priori know that the other components are not
> important!
>
> [2] The modeling technique I'm talking about is decomposition of
> square-integrable function spaces into an orthonormal basis (this is by
> no means the most general way of describing the idea). This is a
> far-reaching concept and pops up under many names. Things like
> "frequency domain", "Fourier transform", "Laplace transform", and
> "spectrum" are among the most elementary. More advanced ones are
> "wavelets", "Hilbert space", "eigenfunctions of a linear operator". A
> smattering of use cases: determining the shape of the electron orbitals
> of hydrogen, designing the fuel control system of a jet engine to make
> sure the right amount of fuel is being provided at any given moment,
> designing the circuits sitting at the end of a communication line (or a
> cell phone antenna) that have to detect incoming signals with the least
> chance of error, analyzing the structure of molecules based on X-ray
> diffraction, determining the chemical composition of the atmosphere of a
> planet orbiting another star.
>
> [3] For example if you know that there is Gaussian noise (even a very
> large amount) on top of an underlying value, then the underlying value
> is just the mean of the population (which will be a Gaussian
> distribution) and can be reliably determined from the mean sample statistic.
>
>
> On Thu, Dec 18, 2014 at 4:10 AM, Sean Silva <chisophugis at gmail.com
> <mailto:chisophugis at gmail.com>> wrote:
>
>     In the future could you please do some sort of visualization of your
>     data, or at least provide the raw data in a machine-readable format
>     so that others can do so?
>
>     It is incredibly easy to come to incorrect conclusions when looking
>     at lists of numbers because at any given moment you have a very
>     localized view of the dataset and are prone to locally pattern-match
>     and form a selection bias that corrupts your ability to make a
>     proper decision in the context of the entire dataset. Even if you go
>     on to look at the rest of the data, this selection bias limits your
>     ability to come to a correct "global" conclusion.
>
>     Appropriate reliable summary statistics can also help, but are not
>     panacea. In using, say, 2 summary statistics (e.g. mean and standard
>     deviation), one is discarding a large number of degrees of freedom
>     from the dataset. This is fine if you have good reason to believe
>     that these 2 degrees of freedom adequately explain the underlying
>     dataset (e.g. there is a sound theoretical description of the
>     phenomenon being measured that suggests it should follow a Gaussian
>     distribution; hence mean and stddev completely characterize the
>     distribution). However, in the world of programs and compiler
>     optimizations, there is very rarely a good reason to believe that
>     any particular dataset (e.g. benchmark results for SPEC for a
>     particular optimization) is explained by a handful of common summary
>     statistics, and so looking only at summary statistics can often
>     conceal important insights into the data (or even be actively
>     misleading). This is especially true when looking across different
>     programs (I die a little inside every time I see someone cite a SPEC
>     geomean).
>
>     In compilers we are usually more interested in actually discovering
>     *which parameters* are responsible for variation, rather than
>     increasing confidence in the values of an a priori set of known
>     parameters. E.g. if you are measuring the time it takes a laser beam
>     to bounce off the moon and come back to you (in order to measure the
>     distance of the moon) you have an a priori known set of parameters
>     that well-characterize the data you obtain, based on your knowledge
>     of the measurement apparatus, atmospheric dispersion, the fact that
>     you know the moon is moving in an orbit, etc. You can perform
>     repeated measurements with the apparatus to narrow in on the values
>     of the parameters. In compilers, we rarely have such a simple and
>     small set of parameters that are known to adequately characterize
>     the data we are trying to understand; when investigating an
>     optimization's results, we are almost always investigating a
>     situation that would resemble (in the moon-bounce analogy) an
>     unknown variation that turns out to be due to whether the lab
>     assistant is leaning against the apparatus or not. You're not going
>     to find out that the lab assistant's pose is at fault by looking at
>     your "repeatedly do them to increase confidence in the values"
>     measurements (e.g. the actual moon-bounce measurements; or looking
>     at the average time for a particular SPEC benchmark to run); you
>     find it by getting up and going to the room with the apparatus and
>     investigating all manner of things until you narrow in on the lab
>     assistant's pose (usually this takes the form of having to dig
>     around in assembly, extract kernels, time particular sub-things,
>     profile things, look at what how the code changes throughout the
>     optimization pipeline, instrument things, etc.; there are tons of
>     places for the root cause to hide).
>
>     If you have to remember something from that last paragraph, remember
>     that not everything boils down to click "run" and get a timing for
>     SPEC. Often you need to take some time to narrow in on the damn lab
>     assistant. Sometimes just the timing of a particular benchmark leads
>     to a "lab assistant" situation (although hopefully this doesn't
>     happen too often; it does happen though: e.g. I have been in a
>     situation where a benchmark surprisingly goes 50% faster on about
>     1/10 runs). When working across different programs, you are almost
>     always in a "lab assistant" situation.
>
>     -- Sean Silva
>
>     On Mon, Dec 15, 2014 at 11:27 AM, Daniel Stewart
>     <stewartd at codeaurora.org <mailto:stewartd at codeaurora.org>> wrote:
>
>         I have done some preliminary investigation into postponing some
>         of the passes to see what the resulting performance impact would
>         be. This is a fairly crude attempt at moving passes around to
>         see if there is any potential benefit. I have attached the patch
>         I used to do the tests, in case anyone is interested.
>
>         Briefly, the patch allows two different flows, with either a
>         flag of –lto-new or –lto-new2. In the first case, the
>         vectorization passes are postponed from the end of
>         populateModulePassManager() function to midway through the
>         addLTOOptimizationPasses(). In the second case, essentially the
>         entire populateModulePassManager() function is deferred until
>         link time.
>
>         I ran spec2000/2006 on an ARM platform (Nexus 4), comparing 4
>         configurations (O3, O3 LTO, O3 LTO new, O3 LTO new 2). I have
>         attached a PDF presenting the results from the test. The first 4
>         columns have the spec result (ratio) for the 4 different
>         configurations. The second set of columns are the respective
>         test / max(result of 4 configurations). I used this last one to
>         see how well/poor a particular configuration was in comparison
>         to other configurations.
>
>         In general, there appears to be some benefit to be gained in a
>         couple of the benchmarks (spec2000/art, spec2006/milc) by
>         postponing vectorization.
>
>         I just wanted to present this information to the community to
>         see if there is interest in pursuing the idea of postponing passes.
>
>         Daniel
>
>         *From:* llvmdev-bounces at cs.uiuc.edu
>         <mailto:llvmdev-bounces at cs.uiuc.edu>
>         [mailto:llvmdev-bounces at cs.uiuc.edu
>         <mailto:llvmdev-bounces at cs.uiuc.edu>] *On Behalf Of *Daniel Stewart
>         *Sent:* Wednesday, September 17, 2014 9:46 AM
>         *To:* llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>
>         *Subject:* [LLVMdev] Postponing more passes in LTO
>
>         Looking at the existing flow of passes for LTO, it appears that
>         most all passes are run on a per file basis, before the call to
>         the gold linker. I’m looking to get people’s feedback on whether
>         there would be an advantage to waiting to run a number of these
>         passes until the linking stage. For example, I believe I saw a
>         post a little while back about postponing vectorization until
>         the linking stage. It seems to me that there could be an
>         advantage to postponing (some) passes until the linking stage,
>         where the entire graph is available. In general, what do people
>         think about the idea of a different flow of LTO where more
>         passes are postponed until the linking stage?
>
>         Daniel Stewart
>
>         --
>
>         Qualcomm Innovation Center, Inc. is a member of Code Aurora
>         Forum, hosted by The Linux Foundation
>
>
>         _______________________________________________
>         LLVM Developers mailing list
>         LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
>         http://llvm.cs.uiuc.edu
>         http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev