[www] r176209 - Add LNT statistics project

Thu Feb 28 11:35:13 PST 2013

Sorry - mashed send by accident... continuing.

On Thu, Feb 28, 2013 at 11:28 AM, David Blaikie <dblaikie at gmail.com> wrote:
> On Thu, Feb 28, 2013 at 11:08 AM, Renato Golin <renato.golin at linaro.org> wrote:
>> On 28 February 2013 18:53, David Blaikie <dblaikie at gmail.com> wrote:
>>>
>>> I'm really confused by what you're saying/getting at.
>>
>>
>> We both are... ;)
>>
>>
>>> Are you suggesting that these are acceptable tests but that my
>>> proposal to add "execute multiple times until desired confidence is
>>> achieved" on top of them would be problematic? Or are you suggesting
>>> that these are examples of bad ideas that are similar/equivalent to my
>>> idea?
>>
>>
>> Linpack and Dhrystone are bad examples of compiler tests (IMHO), since
>> they're developed as platform tests. (see below)
>>
>>
>>> I don't think they're equivalent & I think they're a bad idea as a test
>>> suite.
>>
>>
>> I agree the tests we currently run could be much better executed.
>>
>>
>>> Execution for benchmarks should be deterministic in their behavior
>>> (including execution time/space) not based on clock/wall time
>>> (measuring execution time of an annealing algorithm that spends N
>>> seconds annealing is useless, for example... because it's always going
>>> to take N seconds).
>>
>>
>> Yes, you got me wrong, that would be silly. ;)
>>
>>
>>> 3 seconds +/- 0.25 of a second" (or something like that - of course we
>>> could keep all the actual run times in the report & summarize it in
>>> different ways for different clients).
>>
>>
>> I'd prefer to only output raw numbers and let presentation / crunching logic
>> to the infrastructure.
>>
>>
>>> It could even go a step further and use the previous runs to decide
>>> how much confidence was required for the current run (eg: if
>>> yesterday's run was 3+/-0.25 and after a fewer runs today the test
>>> reached 5+/-1 we could stop running even though the confidence
>>> interval was wider - because we're already way outside the bounds
>>> based on yesterday's performance). This is probably overkill but just
>>> an example.
>>
>>
>> Not all tests have clear idea of confidence. Changing benchmarks to add the
>> idea of confidence requires target specific behaviour (see Livermore Loops)
>> and that's not a good way to spot compiler regressions (IMHO), but is still
>> a valid platform benchmark.
>>
>> My main point is the difference between platform (CPU+MEM+OS+Software+etc),
>> rather than simply compiler optimization and code quality benchmarks.
>>
>> The first category allows heuristics and dynamic testing (confidence check,
>> etc), the second doesn't.
>>
>> Things like micro-benchmarks or vectorization tests can suffer a lot from
>> change in one instruction, and if you add heuristics, you can eliminate the
>> problem before even running the test (by detecting the slowness and trying
>> something different, for example), and we'd never spot the regression that
>> will show in user code (where no heuristics is being done).
>
> I'm still confused as to which things you're talking about. My
> suggestion is that we can get higher confidence on the performance of
> tests by running them multiple times. Also, that we can reduce the
> cost of that increased confidence by choosing the number of runs
> dynamically to reduce the time we pay. Given how long these runs take
> and how important it is to get finer granularity with the same
> hardware resources we have.
>
>> Not to mention that simply increasing the number of cycles by a fixed amount
>> is far simpler to do on a batch of hundreds of completely different
>> benchmarks. :D
>
> Right - and any test like Livermore that uses heuristics & timing to
> motivate its own behavior is a problem even for this idea "run a fixed
> multiple" (& our buildbot configuration currently runs 3 times - but
> it runs the entire suite 3 times one after the other, rather than
> running each test 3 times in a row - so you'll still not get terribly
> stable numbers because each run is cold (but, yes, this might be more
> realistic - but noisy results aren

but noisy results aren't actionable anyway)). So my idea of "run a
dynamic number of times until we get some confidence" applies to all
the cases your "run a fixed number of times" does & has nothing to do
with the problems of Livermore, etc. Those problems exist for both
ideas which is why I'm confused about you bringing them up.

So we already have a way to run a fixed multiple - but this is
expensive to do for large tests that already have low variance &
insufficient for small tests with very high variance (at least on
machines that aren't as quiet as the mac minis). Something dynamic
could get us high confidence without completely blowing out test suite
execution time caused by running all the tests many more times than
necessary because 1 test needed that many runs.