[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Sun Jun 30 09:19:41 PDT 2013

On 06/30/2013 02:14 AM, Anton Korobeynikov wrote:
> Hi Tobi,
>
> First of all, all this is http://llvm.org/bugs/show_bug.cgi?id=1367 :)
>
>> The statistical test ministat is performing seems simple and pretty
>> standard. Is there any reason we could not do something similar? Or are we
>> doing it already and it just does not work as expected?

> The main problem with such sort of tests is that we cannot trust them, unless:
> 1. The data has the normal distribution
> 2. The sample size if large (say, > 50)
>
> Here we have only 3 points and, no, I won't trust the ministat's
> t-test and normal-approximation based confidence bounds. They are *too
> short* (=the real confidence level is no 99.5%, but, actually 40-50%,
> for example).

Hi Anton,

I trust your knowledge about statistics, but am wondering why ministat 
(and it's t-test) is promoted as a statistical sane tool for 
benchmarking results. Is the use of the t-test for benchmark results a 
bad idea in general? Would ministat be a better tool if it implemented 
the Wilcoxon/Mann-Whitney test?

> I'd ask for:
>
> 1. Increasing sample size to at least 5-10
> 2. Do the Wilcoxon/Mann-Whitney test

Reading about the Wilcoxon/Mann-Whitney, it seems to be a more robust 
test that frees us from the normal-approximation assumption. As its 
implementation also does not look overly complicated, it may be a good 
choice.

Regarding the number of samples. I think the most important point is 
that we get some measurement of confidence by which we can sort our 
results and make it visible in the UI. For different use cases we can 
adapt the number of samples based on the required confidence and the 
amount of noise/lost regressions we can accept. This may also be a great 
use for the adaptive sampling that Chris suggested.

Is there anything stopping us from implementing such a test and exposing 
its results in the UI?

Cheers,
Tobi