[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Mon Jul 1 22:11:46 PDT 2013

On 07/01/2013 09:41 AM, Renato Golin wrote:
> On 1 July 2013 02:02, Chris Matthews <chris.matthews at apple.com> wrote:
>
>> One thing that LNT is doing to help “smooth” the results for you is by
>> presenting the min of the data at a particular revision, which (hopefully)
>> is approximating the actual runtime without noise.
>>
>
> That's an interesting idea, as you said, if you run multiple times on every
> revision.
>
> On ARM, every run takes *at least* 1h, other architectures might be a lot
> worse. It'd be very important on those architectures if you could extract
> point information from group data, and min doesn't fit in that model. You
> could take min from a group of runs, but again, that's no different than
> moving averages. Though, "moving mins" might make more sense than "moving
> averages" for the reasons you exposed.

I get your point. On the other side it may be worth getting first 
statistically reliable and noise free numbers with a lower resolution in 
terms of commits. Given those reliable numbers, we can then work on 
improving the resolution (without introducing noice). Also, multiple 
runs per revision should be easy to parallelize on different machines, 
such that confidence in the results seems to be a problem that can be 
solved by additional hardware.

> Also, on tests that take as long as noise to run (0.010s or less on A15),
> the minimum is not relevant, since runtime will flatten everything under
> 0.010 onto 0.010, making your test always report 0.010, even when there are
> regressions.
>
> I really cannot see how you can statistically enhance data in a scenario
> where the measuring rod is larger than the signal. We need to change the
> wannabe-benchmarks to behave like proper benchmarks, and move everything
> else into "Applications" for correctness and specifically NOT time them.
> Less is more.

It is out of question that we can not improve the existing data, but it 
would be great to at least reliably detect that some data is just plain 
noise.

> That works well with a lot of samples per revision, but not for across
>> revisions, where we really need the smoothing.   One way to explore this is
>> to turn
>>
>
> I was really looking forward to that hear the end of that sentence... ;)
>
>
>
> We also lack any way to coordinate or annotate regressions, that is a whole
>> separate problem though.
>>
>
> Yup. I'm having visions of tag clouds, bugzilla integration, cross
> architectural regression detection, etc. But I'll ignore that for now,
> let's solve one big problem at a time. ;)

Yes, there is a lot of stuff that would really help.

Tobi