[llvm-dev] Questions About LLVM Test Suite: Time Units, Re-running benchmarks

Sun Jul 18 20:57:31 PDT 2021

Am So., 18. Juli 2021 um 11:14 Uhr schrieb Stefanos Baziotis via
llvm-dev <llvm-dev at lists.llvm.org>:
> Now, to the questions. First, there doesn't seem to be a common time unit for
> "exec_time" among the different tests. For instance, SingleSource/ seem to use
> seconds while MicroBenchmarks seem to use μs. So, we can't reliably judge
> changes. Although I get the fact that micro-benchmarks are different in nature
> than Single/MultiSource benchmarks, so maybe one should focus only on
> the one or the other depending on what they're interested in.

Usually one does not compare executions of the entire test-suite, but
look for which programs have regressed. In this scenario only relative
changes between programs matter, so μs are only compared to μs and
seconds only compared to seconds.

> In any case, it would at least be great if the JSON data contained the time unit per test,
> but that is not happening either.

What do you mean? Don't you get the exec_time per program?

> Do you think that the lack of time unit info is a problem ? If yes, do you like the
> solution of adding the time unit in the JSON or do you want to propose an alternative?

You could also normalize the time unit that is emitted to JSON to s or ms.

>
> The second question has to do with re-running the benchmarks: I do
> cmake + make + llvm-lit -v -j 1 -o out.json .
> but if I try to do the latter another time, it just does/shows nothing. Is there any reason
> that the benchmarks can't be run a second time? Could I somehow run it a second time ?

Running the programs a second time did work for me in the past.
Remember to change the output to another file or the previous .json
will be overwritten.

> Lastly, slightly off-topic but while we're on the subject of benchmarking,
> do you think it's reliable to run with -j <number of cores> ? I'm a little bit afraid of
> the shared caches (because misses should be counted in the CPU time, which
> is what is measured in "exec_time" AFAIU)
> and any potential multi-threading that the tests may use.

It depends. You can run in parallel, but then you should increase the
number of samples (executions) appropriately to counter the increased
noise. Depending on how many cores your system has, it might not be
worth it, but instead try to make the system as deterministic as
possible (single thread, thread affinity, avoid background processes,
use perf instead of timeit, avoid context switches etc. ). To avoid
systematic bias because always the same cache-sensitive programs run
in parallel, use the --shuffle option.

Michael