[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Thu Jun 27 09:05:33 PDT 2013

On 06/23/2013 11:12 PM, Star Tan wrote:
> Hi all,
>
>
> When we compare two testings, each of which is run with three samples, how would LNT show whether the comparison is reliable or not?
>
>
> I have seen that the function get_value_status in reporting/analysis.py uses a very simple algorithm to infer data status. For example, if abs(self.delta) <= (self.stddev * confidence_interval), then the data status is set as UNCHANGED.  However, it is obviously not enough. For example, assuming both self.delta (e.g. 60%) and self.stddev (e.g. 50%) are huge, but self.delta is slightly larger than self.stddev, LNT will report to readers that the performance improvement is huge without considering the huge stddev. I think one way is to normalize the performance improvements by considering the stddev, but I am not sure whether it has been implemented in LNT.
>
>
> Could anyone give some suggestions that how can I find out whether the testing results are reliable in LNT? Specifically, how can I get the normalized performance improvement/regression by considering the stderr?

Hi Daniel, Michael, Paul,

do you happen to have some insights on this? Basically, the stddev shown
when a run is compared to a previous run does not seem to be useful to
measure the reliability of the results shown. We are looking for a good
way/value to show the reliability of individual results in the UI. Do 
you have some experience, what a good measure of the reliability of test 
results is?

Thanks,
Tobias