[LLVMdev] Postponing more passes in LTO
Sean Silva
chisophugis at gmail.com
Fri Dec 19 14:03:08 PST 2014
On Thu, Dec 18, 2014 at 2:16 PM, Sean Silva <chisophugis at gmail.com> wrote:
> Also, a couple notes about the use of standard deviation (and general
> stuff about benchmarking):
>
> 1. It is *vital* to keep in mind the distinction between a "sample
> statistic" and a "population statistic". A sample statistic is a statistic
> derived from the dataset of measured values, while a population statistic
> is a mathematical function of the underlying probability distribution of
> the phenomenon (some abstract "population" of possibilities) being sampled.
>
> 2. As you take more measurements, the standard deviation sample statistic
> converges to the standard deviation population statistic regardless of the
> underlying population distribution.
>
> 3. *Every* probability distribution has a defined standard deviation (it
> is just a mathematical function of the distribution). For a Gaussian
> distribution, the standard deviation (in conjunction with the mean)
> completely describes the probability distribution. This is emphatically
> *not* the case for all probability distributions [1]. In order to derive
> insight from a standard deviation sample statistic, you *must* a priori
> have a good reason to believe that the underlying probability
> distribution's standard deviation provides useful information about the
> actual distribution.
>
> 4. Any probability distribution can be made to have any standard deviation
> without changing its distribution shape (technically, at most a linear
> transformation of the coordinate is needed). To pick a random example, for
> a highly bimodal distribution the standard deviation doesn't really give
> you any useful information about the overall shape (and it doesn't exactly
> correspond between the spacing between the modes either).
>
> 5. The standard deviation is nothing more than a root-mean-square average.
> Basically, it is a very popular statistic because there is an *extremely
> powerful* way to model phenomena for which the root-mean-square average is
> the actual meaningful value that ties into the model. The theory I'm
> talking about is used extensively in almost every science and engineering
> discipline *outside of* software engineering (and the purely digital part
> of hardware) [2]. Since this modeling technique is rarely directly relevant
> in software, there is little reason to assume that the standard deviation
> is a good statistic (there may be a good reason to, but I haven't seen one).
>
> Okay, so I've cast doubt on the use of standard deviation (and actually an
> argument could be made against the mean too for similar reasons). On a more
> constructive note, what *is* a good small set of parameters to summarize
> measurements of program run time? One interesting model that was brought to
> my attention by Nick Lewycky is based on the following observation: most
> "perturbations" of the run time of a fixed program binary on a fixed chip
> and fixed OS are due to externalities that contend for (or otherwise
> deprive the program of) execution resources and therefore almost always
> make it *slower*.
>
> The model is then that there is a "true" execution time for the program,
> which is a completely deterministic function of the program, chip, and OS;
> intuitively, this is time occurs when the program is e.g. able to execute
> by itself, on a single execution unit of the CPU, with no interrupts
> knocking it off CPU, and no other threads using any of the execution
> resources of this CPU. On top of this, we assume that there is a set of
> perturbations that contend for execution resources and slow the program
> down.
>
> For this model, it makes sense to approximate the "true" value by the
> maximum of the recorded values.
>
D'oh! Here I meant "maximum performance", which would be the *minimum* time!
-- Sean Silva
> If we want to use two numbers to summarize the data, it probably makes
> sense to look at a way of characterizing the extent of the variation from
> this "true" value. In the absence of proper characterization of the
> perturbations (such as the distribution of when they occur and how much
> effect they have when they occur), one simple solution is the minimum. Max
> and min might be *too* sensitive to the perturbations (especially disk
> cache of the program itself on the first run), so a pair of percentiles
> might be a bit more reliable, such as 10th percentile and 90th percentile.
>
> Next time I'm measuring program run time, I'm going to try out this model.
> It should be possible to look at a higher-degree-of-freedom representation
> of the dataset, like a histogram of the distribution, in order to evaluate
> how well this two-parameter characterization of the data works compared
> with the typical mean and stddev. Also, ultimately we only need as much
> precision here as is necessary for what we are doing with the summarized
> data, which typically is to alert us of a significant change. On a
> sufficiently quiet machine, the variance of the measurements might be
> significantly smaller than the thresholds that we consider "significant" so
> that the mean, median, max, min, 10%-ile etc. are all so close that it does
> not matter which we use, and we can summarize the data with just a single
> number, which can be any one of them (and just trigger an error if total
> variation exceeds a set amount).
>
> However, the noisier the machine (and hence our data), the more important
> is is to properly model and analyze things to avoid coming to a faulty
> conclusion (note: reliable conclusions can be made from data with large
> amounts of noise as long as there is a good model which allows isolating
> the data we are interested in from the noise [3]).
>
> Nick, I forget from when we talked, but did you guys ever settle on a
> model like this for your internal benchmarking?
>
>
> [1] If you are familiar with Fourier theory, using only two statistics to
> describe a probability distribution is sort of like using only two Fourier
> components to describe a function. In order to conclude *anything* you must
> a priori know that the other components are not important!
>
> [2] The modeling technique I'm talking about is decomposition of
> square-integrable function spaces into an orthonormal basis (this is by no
> means the most general way of describing the idea). This is a far-reaching
> concept and pops up under many names. Things like "frequency domain",
> "Fourier transform", "Laplace transform", and "spectrum" are among the most
> elementary. More advanced ones are "wavelets", "Hilbert space",
> "eigenfunctions of a linear operator". A smattering of use cases:
> determining the shape of the electron orbitals of hydrogen, designing the
> fuel control system of a jet engine to make sure the right amount of fuel
> is being provided at any given moment, designing the circuits sitting at
> the end of a communication line (or a cell phone antenna) that have to
> detect incoming signals with the least chance of error, analyzing the
> structure of molecules based on X-ray diffraction, determining the chemical
> composition of the atmosphere of a planet orbiting another star.
>
> [3] For example if you know that there is Gaussian noise (even a very
> large amount) on top of an underlying value, then the underlying value is
> just the mean of the population (which will be a Gaussian distribution) and
> can be reliably determined from the mean sample statistic.
>
>
> On Thu, Dec 18, 2014 at 4:10 AM, Sean Silva <chisophugis at gmail.com> wrote:
>>
>> In the future could you please do some sort of visualization of your
>> data, or at least provide the raw data in a machine-readable format so that
>> others can do so?
>>
>> It is incredibly easy to come to incorrect conclusions when looking at
>> lists of numbers because at any given moment you have a very localized view
>> of the dataset and are prone to locally pattern-match and form a selection
>> bias that corrupts your ability to make a proper decision in the context of
>> the entire dataset. Even if you go on to look at the rest of the data, this
>> selection bias limits your ability to come to a correct "global" conclusion.
>>
>> Appropriate reliable summary statistics can also help, but are not
>> panacea. In using, say, 2 summary statistics (e.g. mean and standard
>> deviation), one is discarding a large number of degrees of freedom from the
>> dataset. This is fine if you have good reason to believe that these 2
>> degrees of freedom adequately explain the underlying dataset (e.g. there is
>> a sound theoretical description of the phenomenon being measured that
>> suggests it should follow a Gaussian distribution; hence mean and stddev
>> completely characterize the distribution). However, in the world of
>> programs and compiler optimizations, there is very rarely a good reason to
>> believe that any particular dataset (e.g. benchmark results for SPEC for a
>> particular optimization) is explained by a handful of common summary
>> statistics, and so looking only at summary statistics can often conceal
>> important insights into the data (or even be actively misleading). This is
>> especially true when looking across different programs (I die a little
>> inside every time I see someone cite a SPEC geomean).
>>
>> In compilers we are usually more interested in actually discovering
>> *which parameters* are responsible for variation, rather than increasing
>> confidence in the values of an a priori set of known parameters. E.g. if
>> you are measuring the time it takes a laser beam to bounce off the moon and
>> come back to you (in order to measure the distance of the moon) you have an
>> a priori known set of parameters that well-characterize the data you
>> obtain, based on your knowledge of the measurement apparatus, atmospheric
>> dispersion, the fact that you know the moon is moving in an orbit, etc. You
>> can perform repeated measurements with the apparatus to narrow in on the
>> values of the parameters. In compilers, we rarely have such a simple and
>> small set of parameters that are known to adequately characterize the data
>> we are trying to understand; when investigating an optimization's results,
>> we are almost always investigating a situation that would resemble (in the
>> moon-bounce analogy) an unknown variation that turns out to be due to
>> whether the lab assistant is leaning against the apparatus or not. You're
>> not going to find out that the lab assistant's pose is at fault by looking
>> at your "repeatedly do them to increase confidence in the values"
>> measurements (e.g. the actual moon-bounce measurements; or looking at the
>> average time for a particular SPEC benchmark to run); you find it by
>> getting up and going to the room with the apparatus and investigating all
>> manner of things until you narrow in on the lab assistant's pose (usually
>> this takes the form of having to dig around in assembly, extract kernels,
>> time particular sub-things, profile things, look at what how the code
>> changes throughout the optimization pipeline, instrument things, etc.;
>> there are tons of places for the root cause to hide).
>>
>> If you have to remember something from that last paragraph, remember that
>> not everything boils down to click "run" and get a timing for SPEC. Often
>> you need to take some time to narrow in on the damn lab assistant.
>> Sometimes just the timing of a particular benchmark leads to a "lab
>> assistant" situation (although hopefully this doesn't happen too often; it
>> does happen though: e.g. I have been in a situation where a benchmark
>> surprisingly goes 50% faster on about 1/10 runs). When working across
>> different programs, you are almost always in a "lab assistant" situation.
>>
>> -- Sean Silva
>>
>> On Mon, Dec 15, 2014 at 11:27 AM, Daniel Stewart <stewartd at codeaurora.org
>> > wrote:
>>
>>> I have done some preliminary investigation into postponing some of the
>>> passes to see what the resulting performance impact would be. This is a
>>> fairly crude attempt at moving passes around to see if there is any
>>> potential benefit. I have attached the patch I used to do the tests, in
>>> case anyone is interested.
>>>
>>>
>>>
>>> Briefly, the patch allows two different flows, with either a flag of
>>> –lto-new or –lto-new2. In the first case, the vectorization passes are
>>> postponed from the end of populateModulePassManager() function to midway
>>> through the addLTOOptimizationPasses(). In the second case, essentially the
>>> entire populateModulePassManager() function is deferred until link time.
>>>
>>>
>>>
>>> I ran spec2000/2006 on an ARM platform (Nexus 4), comparing 4
>>> configurations (O3, O3 LTO, O3 LTO new, O3 LTO new 2). I have attached a
>>> PDF presenting the results from the test. The first 4 columns have the spec
>>> result (ratio) for the 4 different configurations. The second set of
>>> columns are the respective test / max(result of 4 configurations). I used
>>> this last one to see how well/poor a particular configuration was in
>>> comparison to other configurations.
>>>
>>>
>>>
>>> In general, there appears to be some benefit to be gained in a couple of
>>> the benchmarks (spec2000/art, spec2006/milc) by postponing vectorization.
>>>
>>>
>>>
>>> I just wanted to present this information to the community to see if
>>> there is interest in pursuing the idea of postponing passes.
>>>
>>>
>>>
>>> Daniel
>>>
>>>
>>>
>>> *From:* llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu]
>>> *On Behalf Of *Daniel Stewart
>>> *Sent:* Wednesday, September 17, 2014 9:46 AM
>>> *To:* llvmdev at cs.uiuc.edu
>>> *Subject:* [LLVMdev] Postponing more passes in LTO
>>>
>>>
>>>
>>> Looking at the existing flow of passes for LTO, it appears that most all
>>> passes are run on a per file basis, before the call to the gold linker. I’m
>>> looking to get people’s feedback on whether there would be an advantage to
>>> waiting to run a number of these passes until the linking stage. For
>>> example, I believe I saw a post a little while back about postponing
>>> vectorization until the linking stage. It seems to me that there could be
>>> an advantage to postponing (some) passes until the linking stage, where the
>>> entire graph is available. In general, what do people think about the idea
>>> of a different flow of LTO where more passes are postponed until the
>>> linking stage?
>>>
>>>
>>>
>>> Daniel Stewart
>>>
>>>
>>>
>>> --
>>>
>>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>>> hosted by The Linux Foundation
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20141219/f8fb021d/attachment.html>
More information about the llvm-dev
mailing list