<div dir="ltr">On 1 July 2013 06:51, James Courtier-Dutton <span dir="ltr"><<a href="mailto:james.dutton@gmail.com" target="_blank">james.dutton@gmail.com</a>></span> wrote:<br><div class="gmail_extra"><div class="gmail_quote">


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div><p><span style="color:rgb(34,34,34)">Another option is to take a deterministic approach to measurement. The code should executive the same cpu  instructions every time it is run, so some method to measure just these instructions should be attempted. Maybe processing qemu logs when llvm is run inside qemu might give a possible solution?</span></p>


</div></blockquote><div></div></div><br></div><div class="gmail_extra">Hi James,</div><div class="gmail_extra"><br></div><div class="gmail_extra">This looks simpler on paper. </div><div class="gmail_extra"><br></div><div class="gmail_extra">


First, no emulator will give you accurate cycle count, or accurate execution sequence, so it's virtually impossible (and practically irrelevant) to benchmark on models.</div><div class="gmail_extra"><br></div><div class="gmail_extra">


Second, measuring "relevant code" is what a benchmark is all about, but instrumenting it (emulators, profilers, etc) to separate matters is making irrelevant what was not.</div><div class="gmail_extra"><br></div>


<div class="gmail_extra">A good benchmark can time just the relevant part and run it thousands/millions of times to improve accuracy. Ours are not all good benchmarks, most are not even benchmarks. We're timing the execution of programs, taking into account OS context switches, CPU schedulers, disk I/O and many other unpredictable (and unrelated) things.</div>


<div class="gmail_extra"><br></div><div class="gmail_extra">The scientific approach is to run multiple times and improve accuracy, but your accuracy will always be no more than half of the minimum measuring distance. So, if we don't increase the run time to make the minimum measuring distance irrelevant, no amount of statistics will give you more accuracy.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">As an example, I ran "hello world" on my laptop and on my chromebook. My laptop gives me 0.001s run time with the occasional 0.002s. My chromebook is never less than 0.010s with 0.012s being the average. That's start-up, libraries and OS interruptions time, mostly.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">Some of the benchmarks (<a href="http://llvm.org/perf/db_default/v4/nts/12944">http://llvm.org/perf/db_default/v4/nts/12944</a>) take between 0.010s and 0.035s to run, which really means nothing at that level of noise.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">Anton, David and Chris are absolutely correct that smoothing the curve will give no real insight on the quality of the results, but it will filter out most false positives. But that's not enough, not even a decent statistical analysis. We need benchmarks to be what they're supposed to: benchmarks.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">If we know a test application is not suitable for benchmarking, stop timing it. If we want to time an application, isolate the hot paths, run them multiple times, etc. </div>

<div class="gmail_extra"><br></div><div class="gmail_extra">One of the original assumptions on the test-suite was to NOT change the applications, because it would be easier to just add a new version, if we ever did. I'm not sure that time saved is really paying off. It's my opinion that we do need to change the application and we do need a different approach to community benchmarks.</div>

<div class="gmail_extra"><br></div><div class="gmail_extra">cheers,</div><div class="gmail_extra">--renato</div></div>