[PATCH] D25277: [test-suite] Run FP tests twice with -ffp-contract=off/on

Thu Oct 6 08:43:49 PDT 2016

hfinkel added a comment.

In https://reviews.llvm.org/D25277#563486, @rengolin wrote:

> > With this patch all the tests specifying an FP_TOLERANCE or FP_ABSTOLERANCE will be run in two stages:
> > 
> > first step is with with extra flags "-ffp-contract=off -fno-fast-math", and recording the full output that will be checked against in the second step
> >  second step is with the default flags specified by the user, and checks against the output recorded in the first step with the FP_TOLERANCE specified by the test.
>
> This doesn't make sense. If there's an FP bug, it can manifest on both sides, meaning we'd only be catching contract problems, not FP problems.
>
> The proposed way was to have the exact same reference output and compare with a higher FP tolerance when running with fp-contract=on.
>
> > There are at least two main problems with this approach:
>
>
>
> > some of the current golden files in the test-suite require testing with FP_TOLERANCE even for stage1
>
> Of course, as expected.
>
> > the build directory goes from 300M to 1.2G due to the extra reference outputs recorded by stage1, and for at least 2 tests (polybench/bicg, polybench/mvt disabled in the current patch) fpcmp takes more than 10 minutes to completely parse the outputs of stage1 and 2. When running test-> suite over small devices will cost 1G more transfer over the network.
>
> This is not acceptable.

We should look at making the polybench run on smaller data sets for these kinds of comparisons. As I recall, we're running into this problem because the current polybench setups are designed to run in a regime where the tests are memory-bandwidth bound. This is where polyhedral transformations are often profitable, which is good, but this is not as useful for this kind of regressions testing.

We can have large polybench tests, this is useful for performance testing of higher-order transformations. They should, however, hash their own output so we have an easy comparison. We can do full dumps on smaller-sized runs. As I've suggested previously, we should save the output of the IEEE-strict run, compare to the hashed output, and then compile/run in the default mode, and compare that to the IEEE-strict output with a higher tolerance.

> 
> 
>> The current implementation has another smaller technical issue:
>>  lit does not allow to specify dependences between tests.
>>  When running "lit -j8" one can end up running stage 2 before stage 1.
> 
> Ouch, that's horrible!

I think that running all of the configurations of a benchmark should happen as one "test", in the lit sense. i.e. the test file that is generated should have multiple run lines.

> 
> 
>> The solution I was considering is to run step 1 as a PREPARE: step,
>>  though that would fail to add the exit code in the output,
>>  and that will end up mismatching the hash.
> 
> That's way too complex. Why not have two different tolerances for each run?

No, we can have multiple RUN lines.

https://reviews.llvm.org/D25277