[PATCH] [LNT] Use Mann-Whitney U test to identify changes

Fri May 2 06:36:11 PDT 2014

On 02/05/2014 14:59, Yi Kong wrote:
> On 02/05/14 11:02, Tobias Grosser wrote:
>> On 02/05/2014 11:16, Yi Kong wrote:
>>> Hi Tobias,
>>>
>>> On 02/05/14 08:38, Tobias Grosser wrote:
>>>> On 01/05/2014 23:27, Yi Kong wrote:
>>>>> This patch adds Mann-Whitney U tests to identify changes, as
>>>>> suggested by Tobias and Anton. User is able to configure the desired
>>>>> confidence level.
>>>>
>>>> Hi Yi Kong,
>>>>
>>>> thanks for this nice patch. I looked into it briefly by setting up an
>>>> LNT server and adding a couple of the -O3 nightly test results to
>>>> it. It
>>>> seems at least with the default 0.5 confidence level this does not
>>>> reduce the noise at all. Just switching aggregative function from
>>>> minimum to mean helps here a lot more (any idea why?).
>>>
>>> Median is far less affected by variance than minimum. Minimum may even
>>> be an outlier.
>>
>> I previously suggested to change this, but Chris had concerns. If you
>> also believe it would be good to show the median by default, maybe it is
>> worth resubmitting a patch?
>
> I'm not sure what concern Chris has.

There is an older thread "Why is the default LNT aggregation function 
min instead of mean" where Chris and David explain the reason why they 
have chosen 'min'.

I just looked into this again using your new patch, and it seems 
switching from min to mean eliminates very reliably all runs with p <= 
0.99 from the Run-over-* changes.

>>>> Did you play with
>>>> the confidence level and got an idea which level would be useful?
>>>> My very brief experiments showed that a value of 0.999 or even
>>>> 0.9999 is
>>>> something that gets us below the noise level. I verified this by
>>>> looking
>>>> at subsequent runs where the commits itself really just where
>>>> documentation commits. Those commits should not show any noise. Even
>>>> with those high confidence requirements, certain performance
>>>> regressions
>>>> such as r205965 can still be spotted. For me, this is already useful as
>>>> we can really ask for extremely low noise answers,
>>>> which will help to at least catch the very clear performance
>>>> regressions. (Compared to today, where even those are hidden in the
>>>> reporting noise)
>>>
>>> I've been experimenting with the same dataset as yours. It seems 0.9
>>> eliminates some noises, but not good enough. Although 0.999 produces
>>> very nice results, but:
>>>   >>> scipy.stats.mannwhitneyu([1,1,1,1,1],[2,2,2,2,1])
>>> (2.5, 0.0099822266526080217)
>>> That's only 0.99! Anything greater than 0.9 will cause too many false
>>> negatives.
>>
>> We are running 10 runs, no?
>>
>>   >>> scipy.stats.mannwhitneyu(
>>       [1,1,1,1,1,1,1,1,1,1],[2,2,2,2,2,2,2,2,2,1])
>>
>>       (5.0, 4.8302595041055999e-05)
>>
>>>> I would like to play with this a little bit more. Do you think it is
>>>> possible to print the p value in the Run-Over-Run Changes Details?
>>>
>>> That shouldn't be too difficult. I could implement it if you want.
>>
>> Cool.
>
> Patches are attached. I've also adjusted default p to a more sensible
> value(0.9).

Very nice. I think they are already by itself very useful and should 
help us to get an better idea how reliable our results are.

 >>> Alternatively you can just print them from the console which only takes
 >>> a one line change.
 >>
 >> I think having them in the webinterface is more convenient. >Especially
 >> if we want to look at only the changes that are reported as changing
 >> performance.
 >>
 >>>> Also, it may make sense to investigate this on another machine. I 
 >use 5
 >>>> identical but different machines. It may be interesting to see if 
 >runs
 >>>> on a same machine are more reliable and could get away with a lower
 >>>> confident interval. Did you do any experiments? Maybe with a 
 >higher run
 >>>> number 20?
 >>>
 >>> For now I don't have other machine to test on, my desktop is far too
 >>> noisy. I've been trying to set up an ARM board.
 >>
 >> I have one machine identical to the automatic builders, but not
 >> connected to any builder. If you send me your SSH public key, I can 
 >give
 >> you access. We could use this to experiment with higher number of
 >> runs, binding benchmarks to fixed CPUs, using 'perf stat' to count the
 >> performance, ..
 >>>
 >> I am very interested in making this results as robust as possible. If
 >> necessary, I would rather choose a smaller set of benchmarks and run
 >> them more often, rather than having to deal with this ongoing noise.
 >>
 >
 > I think hand picking a smaller set of benchmarks is the best idea,
 > since
 > most of the tests we are running are not useful as compiler benchmarks
 > and would make running several iterations possible on ARM boards. As
 > for 'perf stat', I've concerned about the portability of LNT as it is
 > exclusive to Linux.

I think several people are interested in such a list. For systems where 
we generally do not have enough compute power, just picking some obvious 
benchmarks is probably the easiest. In general, I would be OK with 
reducing the list of benchmarks we run, but I would feel a lot more 
confident if we take this decision based on some reasoning not
only based on how famous certain benchmarks are. Maybe the reliability 
numbers we get here could help.

Cheers,
Tobias