[cfe-dev] Rotten Green Tests project

Fri Feb 26 12:23:20 PST 2021

Initial gut reaction would be this is perhaps a big enough patch/divergence
from upstream gtest that it should go into upstream gtest first & maybe
sync'ing up with a more recent gtest into LLVM? Though I realize that's a
bit of a big undertaking (either/both of those steps). How does this
compare to other local patches to gtest we have?

On Fri, Feb 26, 2021 at 10:47 AM via cfe-dev <cfe-dev at lists.llvm.org> wrote:

> This note describes the first part of the Rotten Green Tests project.
>
> "Rotten Green Tests" is the title of a paper presented at the 2019
> International Conference on Software Engineering (ICSE).  Stripped to
> its essentials, the paper describes a method to identify defects or
> oversights in executable tests.  The method has two steps:
>
> (a) Statically identify all "test assertions" in the test program.
> (b) Dynamically determine whether these assertions are actually
> executed.
>
> A test assertion that has been coded but is not executed is termed a
> "rotten green" test, because it allows the test to be green (i.e.,
> pass) without actually enforcing the assertion.  In many cases it is
> not immediately obvious, just by reading the code, that the test has a
> problem; the Rotten Green Test method helps identify these.
>
> The paper describes using this method on projects coded in Pharo
> (which appears to be a Smalltalk descendant) and so the specific tools
> are obviously not applicable to a C++ project such as LLVM.  However,
> the concept can be easily transferred.
>
> I applied these ideas to the Clang and LLVM unittests, because these
> are all executable tests that use the googletest infrastructure.  In
> particular, all "test assertions" are easily identified because they
> make use of macros defined by googletest; by modifying these macros,
> it should be feasible to keep track of all assertions, and report
> whether they have been executed.
>
> The mildly gory details can be saved for the code review and of course
> an LLVM Dev Meeting talk, but the basic idea is: Each test-assertion
> macro will statically allocate a struct identifying the source
> location of the macro, and have an executable statement recording
> whether that assertion has been executed.  Then, when the test exits,
> we look for any of these records that haven't been executed, and
> report them.
>
> I've gotten this to work in three environments so far:
> 1) Linux, with gcc as the build compiler
> 2) Linux, with clang as the build compiler
> 3) Windows, with MSVC as the build compiler
>
> The results are not identical across the three environments.  Besides
> the obvious case that some tests simply don't operate on both Linux
> and Windows, there are some subtleties that cause the infrastructure
> to work less well with gcc than with clang.
>
> The infrastructure depends on certain practices in coding the tests.
>
> First and foremost, it depends on tests being coded to use the
> googletest macros (EXPECT_* and ASSERT*) to express individual test
> assertions.  This is generally true in the unittests, although not as
> universal as might be hoped; ClangRenameTests, for example, buries a
> handful of test assertions inside helper methods, which is a plausible
> coding tactic but makes the RGT infrastructure less useful (because
> many separate tests funnel through the same EXPECT/ASSERT macros, and
> so RGT can't discern whether any of those higher-level tests are
> rotten).
>
> Secondly, "source location" is constrained to filename and line number
> (__FILE__ and __LINE__), therefore we can have at most one assertion
> per source line.  This is generally not a problem, although I did need
> to recode one test that used macros to generate assertions (replacing
> it with a template).  In certain cases it also means gcc doesn't let
> us distinguish multiple assertions, mainly in nested macros, for an
> obscure reason.  But those situations are not very common.
>
> There are a noticeable number of false positives, with two primary
> sources. One is that googletest has a way to mark a test as DISABLED;
> this test is still compiled, although never run, and all its
> assertions will therefore show up as rotten.  The other is due to the
> common LLVM practice of making environmental decisions at runtime
> rather than compile time; for example, using something like 'if
> (isWindows())' rather than an #ifdef.  I've recoded some of the easier
> cases to use #ifdef, in order to reduce the noise.
>
> Some of the noise appears to be irreduceable, which means if we don't
> want to have bots constantly red, we have to have the RGT reporting
> off by default.
>
> Well... actually... it is ON by default; however, I turn it off in
> lit.  So, if you run `check-llvm` or use `llvm-lit` to run unittests,
> they won't report rotten green tests.  However, if you run a program
> directly, it will report them (and cause the test program to exit with
> a failure status).  This seemed like a reasonable balance that would
> make RGT useful while developing a test, without interfering with
> automation.
>
>
> The overall results are quite satisfying; there are many true
> positives, generally representing coding errors within the tests.
> A half-dozen of the unittests have been fixed, with more to come, and
> the RGT patch itself is at: https://reviews.llvm.org/D97566
>
> Thanks,
> --paulr
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20210226/09c7adc1/attachment.html>