[llvm-dev] Noisy benchmark results?

Mon Feb 27 11:42:13 PST 2017

In addition to all the good points given in this thread:

- Nowadays I'd recommend using 'lnt runtest test-suite' instead of 'nt' to use the cmake/lit based variant.
- Alternatively if you just need an A/B comparison run the benchmarks directly as described in http://www.llvm.org/docs/TestSuiteMakefileGuide.html#running-the-test-suite-via-cmake and use test-suite/utils/compare.py
- Use --benchmarking-only (lnt) / -DTEST_SUITE_BENCHMARKING_ONLY (cmake) to remove a number of tests that are useless for performance testing (like all the unittests in there)
- I created a blacklist of benchmarks that are noisy for my target by rerunning the test-suite a few times with the same compiler. I can feed this blacklist to `utils/compare.py --filter-blacklist`
- As we are on the topic. I recommend this talk from last years dev meeting to dampen the expectation that every good compiler transformations must lead to better (or at least neutral) performance:  https://www.youtube.com/watch?v=IX16gcX4vDQ&t=24s  I think one lesson we should draw from this is that we can use benchmarking as an indicator for problems but there is no way around checking the assembly differences manually for the things where we measured different performance.

- Matthias

> On Feb 27, 2017, at 12:46 AM, Mikael Holmén via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> Hi,
> 
> I'm trying to run the benchmark suite:
> http://llvm.org/docs/TestingGuide.html#test-suite-quickstart
> 
> I'm doing it the lnt way, as described at:
> http://llvm.org/docs/lnt/quickstart.html
> 
> I don't know what to expect but the results seems to be quite noisy and unstable. E.g I've done two runs on two different commits that only differ by a space in CODE_OWNERS.txt on my 12 core ubuntu 14.04 machine with:
> 
> lnt runtest nt --sandbox SANDBOX --cc <path-to-my-clang> --test-suite /data/repo/test-suite -j 8
> 
> And then I get the following top execution time regressions:
> http://i.imgur.com/sv1xzlK.png
> 
> The numbers bounce around a lot if I do more runs.
> 
> Given the amount of noise I see here I don't know to sort out significant regressions if I actually do a real change in the compiler.
> 
> Are the above results expected?
> 
> How to use this?
> 
> 
> As a bonus question, if I instead run the benchmarks with an added -m32:
> lnt runtest nt --sandbox SANDBOX --cflag=-m32 --cc <path-to-my-clang> --test-suite /data/repo/test-suite -j 8
> 
> I get three failures:
> 
> --- Tested: 2465 tests --
> FAIL: MultiSource/Applications/ClamAV/clamscan.compile_time (1 of 2465)
> FAIL: MultiSource/Applications/ClamAV/clamscan.execution_time (494 of 2465)
> FAIL: MultiSource/Benchmarks/DOE-ProxyApps-C/XSBench/XSBench.execution_time (495 of 2465)
> 
> Is this known/expected or do I do something stupid?
> 
> Thanks,
> Mikael
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev