[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Sat Jun 29 19:10:53 PDT 2013

On 06/28/2013 01:19 PM, Renato Golin wrote:
> On 28 June 2013 19:45, Chris Matthews <chris.matthews at apple.com>
> wrote:
>
>> Given this tradeoff I think we want to tend towards false positives
>> (over false negatives) strictly as a matter of compiler quality.
>>
>
> False hits are not binary, but (at least) two-dimensional. You can't
> say it's better to have any amount of false positives than any amount
> of false negatives (pretty much like the NSA spying on *everybody* to
> avoid *any* false negative). You can't also say that N
> false-positives is the same as N false-negatives, because a false-hit
> can be huge in itself, or not.
>
> What we have today is a huge amount of false positives and very few
> (or none) false negatives. But even the real positives that we could
> spot even with this amount of noise, we don't, because people don't
> normally look at regressions. If I had to skim through the
> regressions on every build, I'd do nothing else.
>
> Given the proportion, I'd rather have a few small false positives
> and reduce considerably the number of false positives with a hammer
> approach, and only later try to nail down the options and do some
> fine tuning, than doing the fine tuning now while still nobody cares
> about any result because they're not trust-worthy.
>
>
> That said, I’d never object to a professional’s opinion on this
> problem!
>>
>
> Absolutely! And David can help you a lot, there. But I wouldn't try
> to get it perfect before we get it acceptable.

Wow. Thanks a lot for the insights in what LNT is currently doing and
what people are planning for the future. It seems there is a lot of
interesting stuff on the way.

I agree with Renato that one of the major problems is currently not
missing regressions because we do not detect them, but missing them 
because nobody looks at the results due to the large amount of noise.

To make this more concrete I want to point you to the experiments that
Star Tan has run. He hosted his lnt results here [1]. One of the top 
changes in the reports is a 150% compile time increase for 
SingleSource/UnitTests/2003-07-10-SignConversions.c.

Looking at the data of the original run, we get:

~$ cat /tmp/data-before
0.0120
0.0080
0.0200

~$ cat /tmp/data-after
0.0200
0.0240
0.0200

It seems there is a lot of noise involved. Still, LNT is reporting this 
result without understanding that the results for this benchmark are 
unreliable.

In contrast, the ministat [2] tool is perfectly capable of understanding
that those results are insufficient to prove any statistical difference
at 90% confidence.

=======================================================================
$ ./src/ministat -c 90 /tmp/data-before /tmp/data-after
x /tmp/data-before
+ /tmp/data-after
+-----------------------------------------------+
|                                   +           |
|  x          x                     *          +|
||____________M___A______________|_|M___A_____| |
+-----------------------------------------------+
     N           Min           Max        Median           Avg        Stddev
x   3         0.008          0.02         0.012   0.013333333  0.0061101009
+   3          0.02         0.024          0.02   0.021333333  0.0023094011
No difference proven at 90.0% confidence
=======================================================================

Running ministat on the results reported for 
MultiSource/Benchmarks/7zip/7zip-benchmark we can prove a difference
even at 99.5% confidence:

=======================================================================
$ ./src/ministat -c 99.5 /tmp/data2-before /tmp/data2-after
x /tmp/data2-before
+ /tmp/data2-after
+---------------------------------------------------------+
|    x                                               +    |
|x   x                                               +   +|
||__AM|                                              M_A_||
+---------------------------------------------------------+
     N           Min           Max        Median           Avg        Stddev
x   3        45.084        45.344        45.336     45.254667    0.14785579
+   3        48.152         48.36        48.152     48.221333    0.12008886
Difference at 99.5% confidence
	2.96667 +/- 0.788842
	6.55549% +/- 1.74312%
	(Student's t, pooled s = 0.13469)
=======================================================================

The statistical test ministat is performing seems simple and pretty 
standard. Is there any reason we could not do something similar? Or are 
we doing it already and it just does not work as expected?

Filtering and sorting the results by confidence seems very interesting 
to me. In fact, I would like to first look at the performance changes 
reported with 99.5% confidence than at the ones that could not even be 
proven with 90% confidence.

Cheers,
Tobias

[1] http://188.40.87.11:8000/db_default/v4/nts/3
[2] https://github.com/codahale/ministat

-------------- next part --------------
0.0120
0.0080
0.0200
-------------- next part --------------
0.0200
0.0240
0.0200
-------------- next part --------------
45.0840
45.3440
45.3360
-------------- next part --------------
48.1520
48.3600
48.1520