[LLVMdev] Why is the default LNT aggregation function min instead of mean

Thu Jan 16 23:58:38 PST 2014

If you have a 0.004s granularity, and you want to identify small (1% changes) you’ll probably need benchmarks running at least 0.8s.

On Jan 16, 2014, at 6:21 PM, Chandler Carruth <chandlerc at google.com> wrote:

> 
> On Thu, Jan 16, 2014 at 6:09 PM, David Blaikie <dblaikie at gmail.com> wrote:
> On Thu, Jan 16, 2014 at 5:32 PM, Tobias Grosser <tobias at grosser.es> wrote:
> On 01/17/2014 02:17 AM, David Blaikie wrote:
> Right - you usually won't see a normal distribution in the noise of test
> results. You'll see results clustered around the lower bound with a long
> tail of slower and slower results. Depending on how many samples you do it
> might be appropriate to take the mean of the best 3, for example - but the
> general approach of taking the fastest N does have some basis in any case.
> 
> Not necessarily the right answer, the only right answer, etc.
> 
> Interesting. In fact I had the very same thoughts at the beginning.
> 
> However, when looking at my test results the common pattern looks like this example:
> 
> http://llvm.org/perf/db_default/v4/nts/graph?show_all_points=yes&moving_window_size=10&plot.0=34.95.3&submit=Update
> 
> The run-time of a test case is very consistently one of several fixed values. The distribution of the different times is very consistent and seems to form, in fact, something like a normal distribution (more in the center, less at the border).
> 
> The explanation I have here is that the machine is by itself in fact not very noisy. Instead, changes of the execution context (e.g. due to allocation of memory at a different location) influences the performance. If we, by luck, have a run where all 'choices' have been optimal we get minimal performance. However, in case of several independent factors, it is more likely that we get a non-optimal configuration that yields a value in the middle. Consequently, the minimal seems to be a non-optimal choice here.
> 
> I understand that there may be some 'real' noise values, but as the median does not seem to be affected very much by 'extremal' values, I have the feeling it should be reasonable robust to such noise.
> 
> Have you seen examples where the median value gives a wrong impression
> regarding performance?
> 
> I have - and I've also seen the kind of results you're seeing too. One of the issues here is the quantization of results due to very short tests and not very granular timing. This is perhaps the only reason the results even /have/ a median (with finer grained timing and longer tests I expect you'd see fewer results with exactly the same time - yes, you might be in a situation where the exact runtimes repeat due to very short tests being wholely scheduled in one way or another - but in that case you'll get wide, solid swings depending on that scheduling behavior which is also unhelpful) in your results.
> 
> It's one of the reasons I gave up on trying to do timing on Linux - I couldn't get a machine quiet enough to look real. Though in the long run I still did tend to get results for many tests that were clustered around a minima with outliers going upwards... 
> 
> I'm perhaps rambling a bit here, and I'm by no means an authority on this subject (I tried and failed - gave up & worked on other things instead) but I think so long as the data is that noisy and quantized like that, I'm not sure how useful it'll be & not sure if it's the best data to be trying to figure out data processing on. Maybe I'm wrong, perhaps this is as good as that data can get and we do need an answer to how to handle it.
> 
> To jump into this thread mid way, I just wanted to point out that this kind of step-function in timings is almost *always* a sign of an extremely coarse timer. If we can't do better than .4ms (guessing from the graph) of resolution in the timer, we're not going to be able to reasonable measure changes in the CPU-execution times of these tests.
> 
> I would really like to see us move toward counting cycles of an un-throttled processor, and then normalizing that into seconds. If we can't get very accurate (tick granularity) timings, I don't think we can draw reasonable conclusions without *very* long test runs.
> 
> I've long wanted the LNT test suite to run (on linux at least) under 'perf stat' or some other external measurement tool that has fine grained and accurate timing information available in addition to cycle counts and other things that are even more resilient to context switchings, CPU migrations, etc.
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20140116/0a81a308/attachment.html>