[llvm-dev] Floating point variance in the test suite

Thu Jun 24 11:13:05 PDT 2021

> If you truly want to benchmark LLVM, you should really be running specific benchmarks in specific ways and looking very carefully at the results, not relying on the test-suite.

This gets at my questions about which benchmarks are important and who considers them to be important. I expect a lot of us have non-public testing going on for the benchmarks that we consider to be critical. I see the test suite benchmarks as more of a guard rail to catch changes that degrade performance early and in a way that is convenient for other community members to address. So, to me, the benchmarks don’t have to be perfect measures. On the other hand, if we just disable things like fast-math and FMA, the benchmarks won’t tell us anything at all about the impact of changes touching those optimizations.

> What we want is to make sure the program doesn't generate garbage, but garbage means different things for different tests, and having an external tool that knows what each of the tests think is garbage is not practical.

Yes, I agree. Your example in Bugzilla of NEON versus VFP instructions brings up another issue. If I run a test with value-changing optimizations enabled, small variations are acceptable, but if I run the same test with “precise” floating point options, I shouldn’t see any differences from the expected results (depending, of course, on library implementations).

So, I think we need a way for each test to indicate whether it can be run in value-unsafe modes, to set different tolerances for different modes, and to be built to run differently in different modes. For example, if I’m running the Blur test in a value safe mode, there’s no need to perform an internal comparison and a hashed output comparison can be used. If I’m running with fp-contract=on or fast-math, I’d want an internal value check but those modes might have different tolerances. Finally, I might want a way to run the test as a benchmark with either fp-contract=on or fast-math without any check of the results in order to get better performance data.

As for updating the tests, I’m going to bring up test ownership again because I don’t know what constitutes acceptable variation for any given test. I could take a guess at it, but if I get it wrong, my wrong guess becomes semi-enshrined in the test suite and may not be noticed by people who would know better.

For the blur example, the FMA is happening on this line:

          sum_in_current_frame += (inputImage[i + k][j + l] *
                                   gaussianFilter[k + offset][l + offset]);

That’s an accumulated result inside four nested loops. It looks like in practice the differently rounded results with FMA must be getting averaged out most of the time, which makes sense assuming a relatively consistent magnitude of values, but I’d have to study the algorithm to understand exactly what’s happening and how to check the results reliably for a range of inputs. I think that’s too much to expect from someone who is just making some optimization change that triggers a failure in the test.

In the case that led me to start the discussion this week, Melanie was just making the behavior of clang match its documentation. She didn’t even change any optimizations. The failures that were exposed would always have happened if certain compilation options were used. Naturally, she just wanted to not turn any buildbots red. Then I started looking at the failing tests and ended up opening this can of worms.

-Andy

From: Renato Golin <rengolin at gmail.com>
Sent: Thursday, June 24, 2021 1:06 PM
To: Kaylor, Andrew <andrew.kaylor at intel.com>
Cc: llvm-dev at lists.llvm.org; Michael Kruse <llvmdev at meinersbur.de>; amykibm at gmail.com; Hubert Tong <hubert.reinterpretcast at gmail.com>
Subject: Re: [llvm-dev] Floating point variance in the test suite

Hi Andrew,

Sorry I didn't see this before. My reply to bugzilla didn't take into account the contents, here, so are probable moot.

On Thu, 24 Jun 2021 at 17:22, Kaylor, Andrew <andrew.kaylor at intel.com<mailto:andrew.kaylor at intel.com>> wrote:

I don't agree that the result doesn't matter for benchmarks. It seems that the benchmarks are some of the best tests we have for exercising optimizations like this and if the result is wrong by a wide enough margin that could indicate a problem. But I understand Renato’s point that the performance measurement is the primary purpose of the benchmarks, and some numeric differences should be acceptable.
Yes, that's the point I was trying to make. You can't run a benchmark without understanding what it does and what the results mean. Small variations can be fine in one benchmark and totally unacceptable in others. However, what we have in the test-suite are benchmark-turned-tests and tests-turned-benchmarks in which the output is a lot less important if it's more important if it's totally different (ex. error messages, NaNs). My comment was just to the subset we have in the test-suite, not benchmarks in general.

If you truly want to benchmark LLVM, you should really be running specific benchmarks in specific ways and looking very carefully at the results, not relying on the test-suite.

In the previous discussion of this issue, Sebastian Pop proposed having the program run twice -- once with "precise" FP results, and once with the optimizations being tested. For the Blur test, the floating point results are only intermediate and the final (printed) results are a matrix of 8-bit integers. I’m not sure what would constitute an acceptable result for this program. For any given value, an off-by-one result seems acceptable, but if there are too many off-by-one values that would probably indicate a problem. In the Polybench tests, Sebastian modified the tests to do a comparison within the test itself. I don’t know if that’s practical for Blur or if it would be better to have two runs and use a custom comparison tool.
Given the point above about the difference between benchmarks and test-suite-benchmarks, I think having comparisons inside the program itself is probably the best way forward. I should have mentioned that on my list, as I did that, too, in the test-suite.

The main problem with that, for benchmarks, is that they can add substantial runtime and change the profile of the test. But that can be easily fixed by iterating a few more times on the kernel (from the ground state).

What we want is to make sure the program doesn't generate garbage, but garbage means different things for different tests, and having an external tool that knows what each of the tests think is garbage is not practical.

The way I see it, there are only three types of comparison:
 * Text comparison, for tests that must be identical on every platform.
 * Hash comparison, for those above where the output is too big.
 * FP-comparison, for those where the text and integers must be identical but the FP numbers can vary a bit.

The weird behaviour of fpcmp looking at hashes and comparing the numbers in them is a bug, IMO. As is comparing integers and allowing wiggle room.

Using fpcmp for comparing text is fine, because what it does with text and integers should be exactly the same thing as diff, and if the text has FP output, then it also can change depending on precision and it's mostly fine if it does.

To me, the path forward is to fix the tests that break with one of the alternatives above, and make sure fpcmp doesn't identify hex, octal, binary or integers as floating-point, and treat them all as text.

For the Blur test, a quick comparison between the two matrices inside the program (with appropriate wiggle room) would suffice.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210624/22a3f508/attachment.html>