[LLVMdev] [LNT] Question about results reliability in LNT infrustructure

Mon Jul 1 01:17:16 PDT 2013

On 1 July 2013 06:51, James Courtier-Dutton <james.dutton at gmail.com> wrote:

> Another option is to take a deterministic approach to measurement. The
> code should executive the same cpu  instructions every time it is run, so
> some method to measure just these instructions should be attempted. Maybe
> processing qemu logs when llvm is run inside qemu might give a possible
> solution?
>

Hi James,

This looks simpler on paper.

First, no emulator will give you accurate cycle count, or accurate
execution sequence, so it's virtually impossible (and practically
irrelevant) to benchmark on models.

Second, measuring "relevant code" is what a benchmark is all about, but
instrumenting it (emulators, profilers, etc) to separate matters is making
irrelevant what was not.

A good benchmark can time just the relevant part and run it
thousands/millions of times to improve accuracy. Ours are not all good
benchmarks, most are not even benchmarks. We're timing the execution of
programs, taking into account OS context switches, CPU schedulers, disk I/O
and many other unpredictable (and unrelated) things.

The scientific approach is to run multiple times and improve accuracy, but
your accuracy will always be no more than half of the minimum measuring
distance. So, if we don't increase the run time to make the minimum
measuring distance irrelevant, no amount of statistics will give you more
accuracy.

As an example, I ran "hello world" on my laptop and on my chromebook. My
laptop gives me 0.001s run time with the occasional 0.002s. My chromebook
is never less than 0.010s with 0.012s being the average. That's start-up,
libraries and OS interruptions time, mostly.

Some of the benchmarks (http://llvm.org/perf/db_default/v4/nts/12944) take
between 0.010s and 0.035s to run, which really means nothing at that level
of noise.

Anton, David and Chris are absolutely correct that smoothing the curve will
give no real insight on the quality of the results, but it will filter out
most false positives. But that's not enough, not even a decent statistical
analysis. We need benchmarks to be what they're supposed to: benchmarks.

If we know a test application is not suitable for benchmarking, stop timing
it. If we want to time an application, isolate the hot paths, run them
multiple times, etc.

One of the original assumptions on the test-suite was to NOT change the
applications, because it would be easier to just add a new version, if we
ever did. I'm not sure that time saved is really paying off. It's my
opinion that we do need to change the application and we do need a
different approach to community benchmarks.

cheers,
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130701/39885d91/attachment.html>