[LLVMdev] Postponing more passes in LTO

Thu Jan 8 06:51:42 PST 2015

In the context I think we’re most interested in, we’re trying to measure the quality of the code being generated.  As you mention here, it is very target-dependent and depends largely on instruction set features, cache sizes, and many other details of the implementation.

The only way I have been able to think about getting extremely repeatable results is to run the generated code on an emulator and get a synthetic performance measurement of some kind – the number of ticks on the emulator clock or something like that.  (I know with multiple threads that’s too simplistic, and it may not be the only measurement we’re interested in.)

This idea has obvious flaws – emulation is usually slow, and the accuracy of the emulation is critical if you want the numbers to mean something in the real world, and you’d have to have good emulators for all of the targets you really care about. I googled a bit and couldn’t find very much where people are using emulators for performance testing so maybe this idea just doesn’t work.

For one code generator I have worked on we ended up doing our own synthetic performance analysis based on analysis of the IR that we have at the backend, and then instrumenting our output so that at runtime we accumulate the “cost” of the code as it executes.  It has flaws but it’s repeatable and your test harness can say “this code looks worse – someone needs to look at it.”

Otherwise I agree that you need a good statistical model and a very controlled hardware/software environment to do meaningful performance testing.  It’s a hard problem.

From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Chandler Carruth
Sent: Thursday, January 08, 2015 3:55 AM
To: Sean Silva
Cc: llvmdev at cs.uiuc.edu
Subject: Re: [LLVMdev] Postponing more passes in LTO

On Thu, Dec 18, 2014 at 2:16 PM, Sean Silva <chisophugis at gmail.com<mailto:chisophugis at gmail.com>> wrote:
Okay, so I've cast doubt on the use of standard deviation (and actually an argument could be made against the mean too for similar reasons). On a more constructive note, what *is* a good small set of parameters to summarize measurements of program run time? One interesting model that was brought to my attention by Nick Lewycky is based on the following observation: most "perturbations" of the run time of a fixed program binary on a fixed chip and fixed OS are due to externalities that contend for (or otherwise deprive the program of) execution resources and therefore almost always make it *slower*.

The model is then that there is a "true" execution time for the program, which is a completely deterministic function of the program, chip, and OS; intuitively, this is time occurs when the program is e.g. able to execute by itself, on a single execution unit of the CPU, with no interrupts knocking it off CPU, and no other threads using any of the execution resources of this CPU. On top of this, we assume that there is a set of perturbations that contend for execution resources and slow the program down.

For this model, it makes sense to approximate the "true" value by the maximum of the recorded values. If we want to use two numbers to summarize the data, it probably makes sense to look at a way of characterizing the extent of the variation from this "true" value. In the absence of proper characterization of the perturbations (such as the distribution of when they occur and how much effect they have when they occur), one simple solution is the minimum. Max and min might be *too* sensitive to the perturbations (especially disk cache of the program itself on the first run), so a pair of percentiles might be a bit more reliable, such as 10th percentile and 90th percentile.

I used to subscribe to this thinking. I no longer do, and I think it has serious problems. Let me try to explain with a hopefully entertaining analogy. =D

Back in physics class we would analyze the behavior of basic principles of physics as they would act upon a brave cadre of scientifically savvy penguins sliding down slopes under the effect of gravity. Naturally, these slopes were very slick ice, and the physics problems would be solved "neglecting friction"[1]. That made the analysis much more tractable and predictable. It also makes it really inaccurate in the real world.

Now, I'm not actually deriding this approach to physics. The results from neglecting friction are *very* informative as to the basic principles at work. In a very similar way, I don't think the model of a "true" execution time is very informative, but much more about the principles at work than about the actual performance of the software.

The reality I have observed these days, especially in larger software systems, is that there is inherently no deterministic and representative execution to rely on for finding "true" execution time. And too many of the random fluctuations are actually by-products of the program itself. Some examples:

- If the program runs uninterrupted on a single core, it may cause that core to be throttled due to the thermals generated. This can have the effect of lowering the overall performance more than what you would see if the program ran *just* under that level without repeated drops in frequency.

- Different optimizations can change the thermal profile by using different execution units. This is especially true for vector units.

- If you fully leverage the AVX unit of some intel chips, my understanding is that regardless of the specific thermals, those units run at a lower clock!

- Naturally, things like ASLR or other sources of correct but non-deterministic layout of data in memory will cause fluctuations in the rate at which cache misses occur. This is probably the most common cause of non-external fluctuations in performance. Different generated code, register spill frequencies, etc., can dramatically change these cache collision rates.

Now, with all of these factors, we have to assume that in real world scenarios, the programs will not always run at the maximal speed. Instead, there is, much as you describe in your email, some distribution of performance that will be observed in practice. The interesting question is -- can generating different code from the compiler shift the distribution in a meaningful way that is not correlated with the fastest measurement (or the N fastest measurement). And I think the above examples are all places where this can happen. =/

As a consequence, I am increasingly in favor of using very simple but aggregated statistics for measuring performance and acknowledging the lack of precision when it is present. We can still very often measure precisely those localized improvements provided by a specific change even when very small. We just won't see a reflection from them in many cases when measuring large systems that have a high noise floor. Hopefully, we can at least observe good trends in the large system to confirm we are making progress, and at times we can make large improvements that actually rise above the noise floor.

Anyways, my somewhat rambling 2 cents on this. Sorry for the not entirely structured email. =]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150108/ef369b93/attachment.html>