[PATCH] [LNT] Use Mann-Whitney U test to identify changes

Fri May 2 03:02:23 PDT 2014

On 02/05/2014 11:16, Yi Kong wrote:
> Hi Tobias,
>
> On 02/05/14 08:38, Tobias Grosser wrote:
>> On 01/05/2014 23:27, Yi Kong wrote:
>>> This patch adds Mann-Whitney U tests to identify changes, as
>>> suggested by Tobias and Anton. User is able to configure the desired
>>> confidence level.
>>
>> Hi Yi Kong,
>>
>> thanks for this nice patch. I looked into it briefly by setting up an
>> LNT server and adding a couple of the -O3 nightly test results to it. It
>> seems at least with the default 0.5 confidence level this does not
>> reduce the noise at all. Just switching aggregative function from
>> minimum to mean helps here a lot more (any idea why?).
>
> Median is far less affected by variance than minimum. Minimum may even
> be an outlier.

I previously suggested to change this, but Chris had concerns. If you 
also believe it would be good to show the median by default, maybe it is 
worth resubmitting a patch?

>> Did you play with
>> the confidence level and got an idea which level would be useful?
>> My very brief experiments showed that a value of 0.999 or even 0.9999 is
>> something that gets us below the noise level. I verified this by looking
>> at subsequent runs where the commits itself really just where
>> documentation commits. Those commits should not show any noise. Even
>> with those high confidence requirements, certain performance regressions
>> such as r205965 can still be spotted. For me, this is already useful as
>> we can really ask for extremely low noise answers,
>> which will help to at least catch the very clear performance
>> regressions. (Compared to today, where even those are hidden in the
>> reporting noise)
>
> I've been experimenting with the same dataset as yours. It seems 0.9
> eliminates some noises, but not good enough. Although 0.999 produces
> very nice results, but:
>  >>> scipy.stats.mannwhitneyu([1,1,1,1,1],[2,2,2,2,1])
> (2.5, 0.0099822266526080217)
> That's only 0.99! Anything greater than 0.9 will cause too many false
> negatives.

We are running 10 runs, no?

 >>> scipy.stats.mannwhitneyu(
	[1,1,1,1,1,1,1,1,1,1],[2,2,2,2,2,2,2,2,2,1])

	(5.0, 4.8302595041055999e-05)

>> I would like to play with this a little bit more. Do you think it is
>> possible to print the p value in the Run-Over-Run Changes Details?
>
> That shouldn't be too difficult. I could implement it if you want.

Cool.

> Alternatively you can just print them from the console which only takes
> a one line change.

I think having them in the webinterface is more convenient. Especially 
if we want to look at only the changes that are reported as changing 
performance.

>> Also, it may make sense to investigate this on another machine. I use 5
>> identical but different machines. It may be interesting to see if runs
>> on a same machine are more reliable and could get away with a lower
>> confident interval. Did you do any experiments? Maybe with a higher run
>> number 20?
>
> For now I don't have other machine to test on, my desktop is far too
> noisy. I've been trying to set up an ARM board.

I have one machine identical to the automatic builders, but not 
connected to any builder. If you send me your SSH public key, I can give 
you access. We could use this to experiment with higher number of
runs, binding benchmarks to fixed CPUs, using 'perf stat' to count the 
performance, ..

I am very interested in making this results as robust as possible. If 
necessary, I would rather choose a smaller set of benchmarks and run 
them more often, rather than having to deal with this ongoing noise.

Tobias