[llvm-dev] test-suite: a new proposal for how to move forward to make "test-suite" more automatic, more flexible, and more maintainable, especially WRT reference outputs

Kristof Beyls via llvm-dev llvm-dev at lists.llvm.org
Thu Oct 6 02:02:04 PDT 2016


Hi Abe,

My 2 cents:
I have been using the test-suite mainly in benchmarking mode as a convenient way to track performance changes in top-of-trunk.
I've observed that some of the programs (IIRC, especially the ones in SingleSource/Benchmarks/Polybench/) produce a lot of output (megabytes).
This caused a lot of noise in performance measurements, as the execution time was dominated by printing out the data, rather than the actual useful computations. Renato removed the worst noise in http://reviews.llvm.org/D10991.

That experience made me think that for the programs in the test-suite, ideally they should print out only a small amount of output to be checked.
For example, by adapting individual programs that output a lot of data to only print a summary/aggregate of the data, that somehow is likely to change
when a miscomputation happened.

If we could go in that direction, I don't see much need for storing hashes or even compressed output as reference data.
I think that needing compressed reference data may make the test-suite ever so slightly harder to set up: another dependency on an external tool. Not that I can imagine that having a dependency on e.g. gzip would be problematic on any platform.

Anyway, I thought I'd just share my opinion of it being ideal that the programs in the test-suite would only produce small outputs, to avoid noisy benchmark results. If that would be a direction we could go into, there may not be much needed for storing hashes or compressed reference output.

Thanks,

Kristof

On 6 Oct 2016, at 00:29, Abe Skolnik via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

Dear all,

Today I had an idea that might satisfy all the needs for improvement we currently have "on the plate" WRT the repo.-wise sizes of reference outputs and the issues surrounding FP optimizations and how to allow them while still allowing test programs in "test-suite" the output[s] of which depend upon FP computations [and for which relatively-small changes in FP accuracy, whether up/more-accurate or down/less-accurate, change the actual observed output].



For non-FP-dependent, fully-deterministic programs, we can choose the shortest [in # of bytes as reported by "ls"] of the following:

 * hash
 * compressed output
 * raw output

[in increasing order of "likely" size]

... or we can establish some minimum differentiating factors, e.g. "compressed output must be at least 2x smaller than raw output, otherwise stick to raw output" and "hash must be at least 10x smaller than compressed output, otherwise stick to compressed output".  If needed/{strongly desired}, the rules can even be a little more complicated than that, e.g. "compressed output must be at least 2x smaller than raw output OR at least 4096 bytes smaller than raw output, otherwise stick to raw output".



For programs that _are_ either FP-dependent, not-fully-deterministic, or both, I propose that we shall only choose from the set {compressed output, raw output} because:

 1) small-enough variation in the result is expected, normal, and tolerated

and

 2) since this way the raw reference output will be available at the "lit"-running host [after decompression, if needed],
    the "fpcmp" program will be able to be told how much tolerance to allow for each run.

If we only choose from the set {compressed ref. output, raw ref. output} for these tests, then it should be relatively easy to run some tests with output-changing FP optimizations enabled, since those runs won`t depend on the {no-output-changing-FP-optimizations} build having run first.  Although Hal`s suggestion to have the {no-output-changing-FP-optimizations} build produce the output that will be analyzed by the {output-changing FP optimizations enabled} builds is an excellent suggestion, it seems that implementing it in the context of "lit" is a large amount more difficult than we had hoped for.  If anybody reading this knows how to make "lit" only start one test after another one has finished, please chime in.


If compressed ref. outputs will be accepted by the community, then please let me know which of the following would be acceptable to depend on the ability to decompress:

 bz2
 gzip
 xz

I`m perfectly willing to write [a] wrapper[s] that will probe the system for programs that can decompress whatever it is and will choose the best one.


Regards,

Abe
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161006/b7203bd8/attachment.html>


More information about the llvm-dev mailing list