[llvm-commits] [zorg] r146460 - /zorg/trunk/lnt/lnt/db/runinfo.py

Thu Dec 29 12:06:09 PST 2011

Hi Daniel,

I just started reading your changes to zorg: Tobi pointed me to your
recent patches.

On Mon, Dec 12, 2011 at 6:59 PM, Daniel Dunbar <daniel at zuster.org> wrote:
> Author: ddunbar
> Date: Mon Dec 12 18:59:05 2011
> New Revision: 146460
>
> URL: http://llvm.org/viewvc/llvm-project?rev=146460&view=rev
> Log:
> lnt: Take two small steps to reduce higher-than-manageable number of significant
> changes in reports...
>
>  - First, when using an estimated standard deviation, only treat the change as
>   significant if it is above the threshold from the estimated mean. I *think*
>   this is somewhat statistically sound, based on how we do the estimation.
>
>  - Second, don't report any changes with delta's under 0.01 in ignore_small
>   mode. This obviously has no mathematical basis, but appears to be useful in
>   practice.
>
> Modified:
>    zorg/trunk/lnt/lnt/db/runinfo.py
>
> Modified: zorg/trunk/lnt/lnt/db/runinfo.py
> URL: http://llvm.org/viewvc/llvm-project/zorg/trunk/lnt/lnt/db/runinfo.py?rev=146460&r1=146459&r2=146460&view=diff
> ==============================================================================
> --- zorg/trunk/lnt/lnt/db/runinfo.py (original)
> +++ zorg/trunk/lnt/lnt/db/runinfo.py Mon Dec 12 18:59:05 2011
> @@ -10,7 +10,8 @@
>
>  class ComparisonResult:
>     def __init__(self, cur_value, prev_value, delta, pct_delta, stddev, MAD,
> -                 cur_failed, prev_failed, samples):
> +                 cur_failed, prev_failed, samples, stddev_mean = None,
> +                 stddev_is_estimated = False):
>         self.current = cur_value
>         self.previous = prev_value
>         self.delta = delta
> @@ -20,6 +21,8 @@
>         self.failed = cur_failed
>         self.prev_failed = prev_failed
>         self.samples = samples
> +        self.stddev_mean = stddev_mean
> +        self.stddev_is_estimated = stddev_is_estimated
>
>     def get_samples(self):
>         return self.samples
> @@ -65,10 +68,27 @@
>         if ignore_small and abs(self.pct_delta) < .01:
>             return UNCHANGED_PASS
>
> +        # Always ignore changes with small deltas. There is no mathematical
> +        # basis for this, it should be obviated by appropriate statistical
> +        # checks, but practical evidence indicates what we currently have isn't
> +        # good enough (for reasons I do not yet understand).
> +        if ignore_small and abs(self.delta) < .01:
> +            return UNCHANGED_PASS

I guess I am ok with this smoothing "hack" to filter out tests that are
not running long enough.  I see that you are using this computation:

        # Compute the comparison status for the test value.
        delta = run_value - prev_value

and so I assume that the values are in seconds.  I would say that
differences of less than 0.01 seconds are "unnoticeable" unless they
are for testcases that run less than 1 second, in which case a 0.01s
difference is more than what the previous test would discard:

         if ignore_small and abs(self.pct_delta) < .01:
             return UNCHANGED_PASS

> +
>         # If we have a comparison window, then measure using a symmetic
>         # confidence interval.
>         if self.stddev is not None:
> -            if abs(self.delta) > self.stddev * confidence_interval:
> +            is_significant = abs(self.delta) > (self.stddev *
> +                                                confidence_interval)
> +
> +            # If the stddev is estimated, then it is also only significant if
> +            # the delta from the estimate mean is above the confidence interval.
> +            if self.stddev_is_estimated:
> +                is_significant &= (abs(self.current - self.stddev_mean) >
> +                                   self.stddev * confidence_interval)

I think that using this threshold is fine.  From what I see you are computing
the Manhattan distance from the current value of a test to the mean value
and comparing against the standard deviation scaled by the magic constant
(btw, I think that 2.576 is a reasonable default: I was using 2.2 in
http://repo.or.cz/w/gcc-perf-regression-tester.git/blob/34640748810602004e265ad6927b095155ca9772:/analyze-core.R
)

One small problem in here is that the stddev is the Euclidean distance:

def standard_deviation(l):
    m = mean(l)
    means_sqrd = sum([(v - m)**2 for v in l]) / len(l)
    rms = math.sqrt(means_sqrd)
    return rms

So what do you think about changing the is_significant test to also use
the Euclidean distance?

I would have to understand better the code, but probably you can tell me:
what is the size of the window of past results that you are using to
compute the mean and the noise level.

Based on the size of this window, I can see another potential problem:
it could be that you are computing the mean and stddev over all the
past results that you have, in which case, supposing that there are
several improvements and degradations of the performance of the
compiler, you would end up having these speed-ups and slow-downs
represented in the mean and in the stddev.

The way I solved this problem is having a sliding window of (say a magic
number) 10 measures.  If you compute the mean and stddev of this
sliding window, you end up again with a temporal series of mean and
stddev numbers that are a bit less prone to average the speed-ups
and slow-downs of the compiler.  You then compute a second mean
and stddev of this new temporal series and you use that as your
mean and stddev.  That's like computing a second derivative.

Sebastian
--
Qualcomm Innovation Center, Inc is a member of Code Aurora Forum