<div dir="ltr"><br><div class="gmail_extra"><br><div class="gmail_quote">On Wed, May 30, 2018 at 4:07 AM, mbraun via llvm-dev <span dir="ltr"><<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Not going into all the detail, but from my side the big question is whether the benchmarks inner loop is small/fine grained enough that stabilization with google benchmark doesn't lead to dozens of seconds benchmark runtimes. Given that you typically see thousandsd or millions of invocations for small functions...<br></blockquote><div><br></div>Google benchmarks executes the kernel at max 1e9 times or until CPU time is greater than the minimum time or the wallclock time is 5x minimum time, by default min time is 0.5s but we can change it using "MinTime(X)" or  "--benchmark_min_time=X". </div><div class="gmail_quote">So the <span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">stabilization of small/fine grained kernel with google benchmark should not be any problem.</span><br> <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<div><div class="gmail-h5"><br>

> On May 29, 2018, at 2:06 PM, Michael Kruse via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br>

> <br>

> Thanks for your remarks.<br>

> <br>

> 2018-05-27 5:19 GMT-05:00 Dean Michael Berris <<a href="mailto:dean.berris@gmail.com">dean.berris@gmail.com</a>>:<br>

>> I think you might run into artificial overhead here if you’re not careful. In particular you might run into:<br>

>> <br>

>> - Missed in-lining opportunity in the benchmark. If you expect the kernels to be potentially inlined, this might be a problem.<br>

> <br>

> For the kind of benchmarks we have in mind, one function call overhead<br>

> is not significant, nor would we expect compilers to inline them in<br>

> the applications that use them.<br>

> Inlining may even be counterproductive: After inlining might see from<br>

> the array initialization code that elements are initialized to 0 (or<br>

> some other deterministic value) and use that knowledge while<br>

> optimizing the kernel.<br>

> We might prevent such things by annotating the kernel with<br>

> __attribute__((noinline)).<br>

> <br>

> <br>

>> - The link order might cause interference depending on the linker being used.<br>

>> <br>

>> - If you’re doing LTO then that would add an additional wrinkle.<br>

>> <br>

>> They’re not show-stoppers, but these are some of the things to look out for and consider.<br>

> <br>

> I'd consider the benchmark to be specific to a compiler+linker<br>

> combination. As long as we can measure the kernel in isolation (and<br>

> consider cold/warm caches), it should be fine.<br>

> I'd switch off LTO here since any code that could be inlined into the<br>

> kernel should already be in its translation unit.<br>

> <br>

> <br>

>> <br>

>>> - Instruct the driver to run the kernel with a small problem size and<br>

>>> check the correctness.<br>

>> <br>

>> In practice, what I’ve seen is mixing unit tests which perform correctness checks (using Google Test/Mock) and then co-locating the benchmarks in the same file. This way you can choose to run just the tests or the benchmarks in the same compilation mode. I’m not sure whether there’s already a copy of the Google Test/Mock libraries in the test-suite, but I’d think those shouldn’t be too hard (nor controversial) to add.<br>

> <br>

> Google Test is already part of LLVM. Since the test-suite already has<br>

> a dependency on LLVM (e.g. for llvm-lit, itself already supporting<br>

> Google Test), we could just use that one.<br>

> I don't know yet one would combine them to run the same code.<br>

> <br>

> <br>

>>> - Instructs Google Benchmark to run the kernel to get a reliable<br>

>>> average execution time of the kernel (without the input data<br>

>>> initialization)<br>

>> <br>

>> There’s ways to write the benchmarks so that you only measure a small part of the actual benchmark. The manuals will be really helpful in pointing out how to do that.<br>

>> <br>

>> <a href="https://github.com/google/benchmark#passing-arguments" rel="noreferrer" target="_blank">https://github.com/google/<wbr>benchmark#passing-arguments</a><br>

>> <br>

>> In particular, you can pause the timing when you’re doing the data initialisation and then resume just before you run the kernel.<br>

> <br>

> Sounds great.<br>

> <br>

> <br>

>>> - LNT's --exec-multisample does not need to run the benchmarks<br>

>>> multiple times, as Google Benchmark already did so.<br>

>> <br>

>> I thought recent patches already does some of this. Hal would know.<br>

> <br>

> I haven't found any special handling for --exec-multisample in MicroBenchmarks.<br>

<br>

</div></div>This LNT option means run the whole benchmarking multiple times. It's more or less a loop the whole of the benchmark run: This even means we compile the source code multiple times.<br><br>

FWIW I'd really like to see an alternative option to compile once and then run each benchmark executable multiple times before moving to the next executable; but given that we don't have that today I'm just not using the multisample option personally as it's not better than submitting multiple benchmarking jobs anyway...<br></blockquote><div> </div>I think its better to use "benchmark_repetitions=n" option of google benchmark than "--exec-multisample=n" as in multisample data initialization will also be executed again along with the main kernel but benchmark_repetitions option runs only the main kernel n times. The repeated execution of data inititalization will only add unwanted execution time.<br><br>Also, google benchmark library gives the result for each run along with the mean, stddev and median of all runs. This can be usefull if there is some tool which can parse the stdout and write some summary in a sheet.<br><br>One thing that I would like to have is some tool/API that can automatically verifiy the output like lit does using reference output.</div><div class="gmail_quote"><br></div><div class="gmail_quote"><br><div><span style="color:rgb(0,0,0)">Regards,</span><br></div></div><div class="gmail_signature"><div dir="ltr"><div dir="ltr"><font style="background-color:rgb(255,255,255)" color="#000000">Pankaj Kukreja</font><div><font style="background-color:rgb(255,255,255)" color="#000000">Computer Science Department</font></div><div><font style="background-color:rgb(255,255,255)" color="#000000">IIT Hyderabad</font></div></div></div></div>

</div></div>