[LLVMdev] RFC:LNT Improvements

Tue Apr 29 23:50:07 PDT 2014

Hi Yi Kong,

thanks for working on this. I think there is a lot we can improve here. 
I copied Mingxing Tan who has worked on a couple of patches in this area 
before and Chris, who is maintaining LNT.

On 30/04/2014 00:49, Yi Kong wrote:
> Dear all,
>
> Following the Benchmarking BOF from 2013 US dev meeting, I’d like to propose some improvements to the LNT performance tracking software.
>
> The most significant issue with current implementation is that the report is filled with extremely noisy values. Hence it is hard to notice performance improvements or regressions.

Right.

> After investigation of LNT and the LLVM test suite, I propose following methods. I've also attached prototype patches for each method.
> - Increase the execution time of the benchmark so it runs long enough to avoid noisy results
>          Currently there are two options to run benchmarks, namely small and large problem size. I propose adding a third option: adaptive. In adaptive mode, benchmarks scale the problem size according to pre-measured system performance value so that the running time is kept at around 10 seconds, the sweet spot between time and accuracy. The downside is that correctness for some benchmarks cannot be measured. Solution is to measure correctness in a separate board with small problem size.
>          LNT: [PATCH 2/3] Add options to run test-suite in adaptive mode
>          Test suite: [PATCH 1/2] Add support for adaptive problem size
>                          [PATCH 2/2] A subset of test suite programs modified for adaptive

I think it will be easier to review such patches one by one on the 
commit mailing lists. Especially as this one is a little larger.

In general, I see such changes as a second step. First, we want to have 
a system in place that allows us to reliably detect if a benchmark is 
noisy or not, second we want to increase the number of benchmarks that 
are not noisy and where we can use the results.

> - Show and graph total compile time
>          There is no obvious way to scale up the compile time of individual benchmarks, so total time is the best thing we can do to minimize error.
>          LNT: [PATCH 1/3] Add Total to run view and graph plot

I did not see the effect of these changes in your images and also 
honestly do not fully understand what you are doing. What is the total 
compile time? Don't we already show the compile time in run view? How is 
the total time different to this compile time?

Maybe you can answer this in a separate patch email.

> - Only show performance changes with high confidence in summary report
>          To investigate the correlation between program run time and its variance, I ran Dhrystone of different problem size multiple times. The result shows that some fluctuations are expected and shorter tests have much greater variance. By modelling the run time to be normally distributed, we can calculate the minimal difference for statistical significance. Using this knowledge, we can hide those results with low confidence level from summary report. They are still available and marked in colour in detailed report in case interested.
>          LNT: [PATCH 3/3] Ignore tests with very short run time

I think this is the most important point which we should address first.
In fact, I would prefer to go even further and actually compute the 
confidence and make the confidence we require an option. This allows
us to understand both how stable/noisy a machine is and how well the 
other changes you propose work in practice.

We had a longer discussion here on llvmdev names 'Questions about 
results reliability in LNT infrustructure'. Anton suggested to do the
following:

1. Get 5-10 samples per run
2. Do the Wilcoxon/Mann-Whitney test

I already set up -O3 buildbots that provide 10 runs, per commit and the
noise for them is very low:

http://llvm.org/perf/db_default/v4/nts/25151?num_comparison_runs=10&test_filter=&test_min_value_filter=&aggregation_fn=median&compare_to=25149&submit=Update

If you are interested in performance data to test your changes, you can 
extract the results from the LLVM buildmaster at:

http://lab.llvm.org:8011/builders/polly-perf-O3/builds/2942/steps/lnt.nightly-test/logs/report.json

with 2942 being one of the latest successful builds. By going backwards
or forwards you should get other builds if they have been successful.

There should be a standard function for the wilcoxon/mann-whitney in
scipy?, so in case you are interested adding this reliability numbers as
a first step seems to be a simple and purely beneficial commit.

> - Make sure board has low background noise
>          Perform a system performance benchmark before each run and compare the value with the reference(obtained during machine set-up). If the percentage difference is too large, abort or defer the run. In prototype this feature is implemented using Bash script and not integrated into LNT. Will rewrite in Python.
>          LNT: benchmark.sh

I am a little sceptical on this. Machines should generally not be noisy. 
However, if for some reason there is noise on the machine, the noise is 
as likely to appear during this pre-noise-test than during the actual 
benchmark runs, maybe during both, but maybe also only during the 
benchmark. So I am afraid we might often run in the situation where this 
test says OK but the later test is still suffering noise.

I would probably prefer to make the previous point of reporting 
reliability work well and then we can see for each test/benchmark if 
there was noise involved or not.

All the best,
Tobias