[llvm-dev] Rotten Green Tests project

Fri Feb 26 10:47:03 PST 2021

This note describes the first part of the Rotten Green Tests project.

"Rotten Green Tests" is the title of a paper presented at the 2019
International Conference on Software Engineering (ICSE).  Stripped to
its essentials, the paper describes a method to identify defects or
oversights in executable tests.  The method has two steps:

(a) Statically identify all "test assertions" in the test program.
(b) Dynamically determine whether these assertions are actually
executed.

A test assertion that has been coded but is not executed is termed a
"rotten green" test, because it allows the test to be green (i.e.,
pass) without actually enforcing the assertion.  In many cases it is
not immediately obvious, just by reading the code, that the test has a
problem; the Rotten Green Test method helps identify these.

The paper describes using this method on projects coded in Pharo
(which appears to be a Smalltalk descendant) and so the specific tools
are obviously not applicable to a C++ project such as LLVM.  However,
the concept can be easily transferred.

I applied these ideas to the Clang and LLVM unittests, because these
are all executable tests that use the googletest infrastructure.  In
particular, all "test assertions" are easily identified because they
make use of macros defined by googletest; by modifying these macros,
it should be feasible to keep track of all assertions, and report
whether they have been executed.

The mildly gory details can be saved for the code review and of course
an LLVM Dev Meeting talk, but the basic idea is: Each test-assertion
macro will statically allocate a struct identifying the source
location of the macro, and have an executable statement recording
whether that assertion has been executed.  Then, when the test exits,
we look for any of these records that haven't been executed, and
report them.

I've gotten this to work in three environments so far:
1) Linux, with gcc as the build compiler
2) Linux, with clang as the build compiler
3) Windows, with MSVC as the build compiler

The results are not identical across the three environments.  Besides
the obvious case that some tests simply don't operate on both Linux
and Windows, there are some subtleties that cause the infrastructure
to work less well with gcc than with clang.

The infrastructure depends on certain practices in coding the tests.

First and foremost, it depends on tests being coded to use the
googletest macros (EXPECT_* and ASSERT*) to express individual test
assertions.  This is generally true in the unittests, although not as
universal as might be hoped; ClangRenameTests, for example, buries a
handful of test assertions inside helper methods, which is a plausible
coding tactic but makes the RGT infrastructure less useful (because
many separate tests funnel through the same EXPECT/ASSERT macros, and
so RGT can't discern whether any of those higher-level tests are
rotten).

Secondly, "source location" is constrained to filename and line number
(__FILE__ and __LINE__), therefore we can have at most one assertion
per source line.  This is generally not a problem, although I did need
to recode one test that used macros to generate assertions (replacing
it with a template).  In certain cases it also means gcc doesn't let
us distinguish multiple assertions, mainly in nested macros, for an
obscure reason.  But those situations are not very common.

There are a noticeable number of false positives, with two primary
sources. One is that googletest has a way to mark a test as DISABLED;
this test is still compiled, although never run, and all its
assertions will therefore show up as rotten.  The other is due to the
common LLVM practice of making environmental decisions at runtime
rather than compile time; for example, using something like 'if
(isWindows())' rather than an #ifdef.  I've recoded some of the easier
cases to use #ifdef, in order to reduce the noise.

Some of the noise appears to be irreduceable, which means if we don't
want to have bots constantly red, we have to have the RGT reporting
off by default.

Well... actually... it is ON by default; however, I turn it off in
lit.  So, if you run `check-llvm` or use `llvm-lit` to run unittests,
they won't report rotten green tests.  However, if you run a program
directly, it will report them (and cause the test program to exit with
a failure status).  This seemed like a reasonable balance that would
make RGT useful while developing a test, without interfering with
automation.

The overall results are quite satisfying; there are many true
positives, generally representing coding errors within the tests.
A half-dozen of the unittests have been fixed, with more to come, and
the RGT patch itself is at: https://reviews.llvm.org/D97566

Thanks,
--paulr