[LLVMdev] Postponing more passes in LTO
Nick Lewycky
nicholas at mxc.ca
Wed Jan 7 22:58:48 PST 2015
Sean Silva wrote:
> Also, a couple notes about the use of standard deviation (and general
> stuff about benchmarking):
>
> 1. It is *vital* to keep in mind the distinction between a "sample
> statistic" and a "population statistic". A sample statistic is a
> statistic derived from the dataset of measured values, while a
> population statistic is a mathematical function of the underlying
> probability distribution of the phenomenon (some abstract "population"
> of possibilities) being sampled.
>
> 2. As you take more measurements, the standard deviation sample
> statistic converges to the standard deviation population statistic
> regardless of the underlying population distribution.
>
> 3. *Every* probability distribution has a defined standard deviation (it
> is just a mathematical function of the distribution). For a Gaussian
> distribution, the standard deviation (in conjunction with the mean)
> completely describes the probability distribution. This is emphatically
> *not* the case for all probability distributions [1]. In order to derive
> insight from a standard deviation sample statistic, you *must* a priori
> have a good reason to believe that the underlying probability
> distribution's standard deviation provides useful information about the
> actual distribution.
>
> 4. Any probability distribution can be made to have any standard
> deviation without changing its distribution shape (technically, at most
> a linear transformation of the coordinate is needed). To pick a random
> example, for a highly bimodal distribution the standard deviation
> doesn't really give you any useful information about the overall shape
> (and it doesn't exactly correspond between the spacing between the modes
> either).
>
> 5. The standard deviation is nothing more than a root-mean-square
> average. Basically, it is a very popular statistic because there is an
> *extremely powerful* way to model phenomena for which the
> root-mean-square average is the actual meaningful value that ties into
> the model. The theory I'm talking about is used extensively in almost
> every science and engineering discipline *outside of* software
> engineering (and the purely digital part of hardware) [2]. Since this
> modeling technique is rarely directly relevant in software, there is
> little reason to assume that the standard deviation is a good statistic
> (there may be a good reason to, but I haven't seen one).
>
> Okay, so I've cast doubt on the use of standard deviation (and actually
> an argument could be made against the mean too for similar reasons). On
> a more constructive note, what *is* a good small set of parameters to
> summarize measurements of program run time? One interesting model that
> was brought to my attention by Nick Lewycky is based on the following
> observation: most "perturbations" of the run time of a fixed program
> binary on a fixed chip and fixed OS are due to externalities that
> contend for (or otherwise deprive the program of) execution resources
> and therefore almost always make it *slower*.
>
> The model is then that there is a "true" execution time for the program,
> which is a completely deterministic function of the program, chip, and
> OS; intuitively, this is time occurs when the program is e.g. able to
> execute by itself, on a single execution unit of the CPU, with no
> interrupts knocking it off CPU, and no other threads using any of the
> execution resources of this CPU. On top of this, we assume that there is
> a set of perturbations that contend for execution resources and slow the
> program down.
>
> For this model, it makes sense to approximate the "true" value by the
> maximum of the recorded values. If we want to use two numbers to
> summarize the data, it probably makes sense to look at a way of
> characterizing the extent of the variation from this "true" value. In
> the absence of proper characterization of the perturbations (such as the
> distribution of when they occur and how much effect they have when they
> occur), one simple solution is the minimum. Max and min might be *too*
> sensitive to the perturbations (especially disk cache of the program
> itself on the first run), so a pair of percentiles might be a bit more
> reliable, such as 10th percentile and 90th percentile.
>
> Next time I'm measuring program run time, I'm going to try out this
> model. It should be possible to look at a higher-degree-of-freedom
> representation of the dataset, like a histogram of the distribution, in
> order to evaluate how well this two-parameter characterization of the
> data works compared with the typical mean and stddev. Also, ultimately
> we only need as much precision here as is necessary for what we are
> doing with the summarized data, which typically is to alert us of a
> significant change. On a sufficiently quiet machine, the variance of the
> measurements might be significantly smaller than the thresholds that we
> consider "significant" so that the mean, median, max, min, 10%-ile etc.
> are all so close that it does not matter which we use, and we can
> summarize the data with just a single number, which can be any one of
> them (and just trigger an error if total variation exceeds a set amount).
>
> However, the noisier the machine (and hence our data), the more
> important is is to properly model and analyze things to avoid coming to
> a faulty conclusion (note: reliable conclusions can be made from data
> with large amounts of noise as long as there is a good model which
> allows isolating the data we are interested in from the noise [3]).
>
> Nick, I forget from when we talked, but did you guys ever settle on a
> model like this for your internal benchmarking?
We have not. Our current system (very quiesced machines) works well
enough for our current purposes.
Nick
> [1] If you are familiar with Fourier theory, using only two statistics
> to describe a probability distribution is sort of like using only two
> Fourier components to describe a function. In order to conclude
> *anything* you must a priori know that the other components are not
> important!
>
> [2] The modeling technique I'm talking about is decomposition of
> square-integrable function spaces into an orthonormal basis (this is by
> no means the most general way of describing the idea). This is a
> far-reaching concept and pops up under many names. Things like
> "frequency domain", "Fourier transform", "Laplace transform", and
> "spectrum" are among the most elementary. More advanced ones are
> "wavelets", "Hilbert space", "eigenfunctions of a linear operator". A
> smattering of use cases: determining the shape of the electron orbitals
> of hydrogen, designing the fuel control system of a jet engine to make
> sure the right amount of fuel is being provided at any given moment,
> designing the circuits sitting at the end of a communication line (or a
> cell phone antenna) that have to detect incoming signals with the least
> chance of error, analyzing the structure of molecules based on X-ray
> diffraction, determining the chemical composition of the atmosphere of a
> planet orbiting another star.
>
> [3] For example if you know that there is Gaussian noise (even a very
> large amount) on top of an underlying value, then the underlying value
> is just the mean of the population (which will be a Gaussian
> distribution) and can be reliably determined from the mean sample statistic.
>
>
> On Thu, Dec 18, 2014 at 4:10 AM, Sean Silva <chisophugis at gmail.com
> <mailto:chisophugis at gmail.com>> wrote:
>
> In the future could you please do some sort of visualization of your
> data, or at least provide the raw data in a machine-readable format
> so that others can do so?
>
> It is incredibly easy to come to incorrect conclusions when looking
> at lists of numbers because at any given moment you have a very
> localized view of the dataset and are prone to locally pattern-match
> and form a selection bias that corrupts your ability to make a
> proper decision in the context of the entire dataset. Even if you go
> on to look at the rest of the data, this selection bias limits your
> ability to come to a correct "global" conclusion.
>
> Appropriate reliable summary statistics can also help, but are not
> panacea. In using, say, 2 summary statistics (e.g. mean and standard
> deviation), one is discarding a large number of degrees of freedom
> from the dataset. This is fine if you have good reason to believe
> that these 2 degrees of freedom adequately explain the underlying
> dataset (e.g. there is a sound theoretical description of the
> phenomenon being measured that suggests it should follow a Gaussian
> distribution; hence mean and stddev completely characterize the
> distribution). However, in the world of programs and compiler
> optimizations, there is very rarely a good reason to believe that
> any particular dataset (e.g. benchmark results for SPEC for a
> particular optimization) is explained by a handful of common summary
> statistics, and so looking only at summary statistics can often
> conceal important insights into the data (or even be actively
> misleading). This is especially true when looking across different
> programs (I die a little inside every time I see someone cite a SPEC
> geomean).
>
> In compilers we are usually more interested in actually discovering
> *which parameters* are responsible for variation, rather than
> increasing confidence in the values of an a priori set of known
> parameters. E.g. if you are measuring the time it takes a laser beam
> to bounce off the moon and come back to you (in order to measure the
> distance of the moon) you have an a priori known set of parameters
> that well-characterize the data you obtain, based on your knowledge
> of the measurement apparatus, atmospheric dispersion, the fact that
> you know the moon is moving in an orbit, etc. You can perform
> repeated measurements with the apparatus to narrow in on the values
> of the parameters. In compilers, we rarely have such a simple and
> small set of parameters that are known to adequately characterize
> the data we are trying to understand; when investigating an
> optimization's results, we are almost always investigating a
> situation that would resemble (in the moon-bounce analogy) an
> unknown variation that turns out to be due to whether the lab
> assistant is leaning against the apparatus or not. You're not going
> to find out that the lab assistant's pose is at fault by looking at
> your "repeatedly do them to increase confidence in the values"
> measurements (e.g. the actual moon-bounce measurements; or looking
> at the average time for a particular SPEC benchmark to run); you
> find it by getting up and going to the room with the apparatus and
> investigating all manner of things until you narrow in on the lab
> assistant's pose (usually this takes the form of having to dig
> around in assembly, extract kernels, time particular sub-things,
> profile things, look at what how the code changes throughout the
> optimization pipeline, instrument things, etc.; there are tons of
> places for the root cause to hide).
>
> If you have to remember something from that last paragraph, remember
> that not everything boils down to click "run" and get a timing for
> SPEC. Often you need to take some time to narrow in on the damn lab
> assistant. Sometimes just the timing of a particular benchmark leads
> to a "lab assistant" situation (although hopefully this doesn't
> happen too often; it does happen though: e.g. I have been in a
> situation where a benchmark surprisingly goes 50% faster on about
> 1/10 runs). When working across different programs, you are almost
> always in a "lab assistant" situation.
>
> -- Sean Silva
>
> On Mon, Dec 15, 2014 at 11:27 AM, Daniel Stewart
> <stewartd at codeaurora.org <mailto:stewartd at codeaurora.org>> wrote:
>
> I have done some preliminary investigation into postponing some
> of the passes to see what the resulting performance impact would
> be. This is a fairly crude attempt at moving passes around to
> see if there is any potential benefit. I have attached the patch
> I used to do the tests, in case anyone is interested.
>
> Briefly, the patch allows two different flows, with either a
> flag of –lto-new or –lto-new2. In the first case, the
> vectorization passes are postponed from the end of
> populateModulePassManager() function to midway through the
> addLTOOptimizationPasses(). In the second case, essentially the
> entire populateModulePassManager() function is deferred until
> link time.
>
> I ran spec2000/2006 on an ARM platform (Nexus 4), comparing 4
> configurations (O3, O3 LTO, O3 LTO new, O3 LTO new 2). I have
> attached a PDF presenting the results from the test. The first 4
> columns have the spec result (ratio) for the 4 different
> configurations. The second set of columns are the respective
> test / max(result of 4 configurations). I used this last one to
> see how well/poor a particular configuration was in comparison
> to other configurations.
>
> In general, there appears to be some benefit to be gained in a
> couple of the benchmarks (spec2000/art, spec2006/milc) by
> postponing vectorization.
>
> I just wanted to present this information to the community to
> see if there is interest in pursuing the idea of postponing passes.
>
> Daniel
>
> *From:* llvmdev-bounces at cs.uiuc.edu
> <mailto:llvmdev-bounces at cs.uiuc.edu>
> [mailto:llvmdev-bounces at cs.uiuc.edu
> <mailto:llvmdev-bounces at cs.uiuc.edu>] *On Behalf Of *Daniel Stewart
> *Sent:* Wednesday, September 17, 2014 9:46 AM
> *To:* llvmdev at cs.uiuc.edu <mailto:llvmdev at cs.uiuc.edu>
> *Subject:* [LLVMdev] Postponing more passes in LTO
>
> Looking at the existing flow of passes for LTO, it appears that
> most all passes are run on a per file basis, before the call to
> the gold linker. I’m looking to get people’s feedback on whether
> there would be an advantage to waiting to run a number of these
> passes until the linking stage. For example, I believe I saw a
> post a little while back about postponing vectorization until
> the linking stage. It seems to me that there could be an
> advantage to postponing (some) passes until the linking stage,
> where the entire graph is available. In general, what do people
> think about the idea of a different flow of LTO where more
> passes are postponed until the linking stage?
>
> Daniel Stewart
>
> --
>
> Qualcomm Innovation Center, Inc. is a member of Code Aurora
> Forum, hosted by The Linux Foundation
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu <mailto:LLVMdev at cs.uiuc.edu>
> http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
More information about the llvm-dev
mailing list