[lldb-dev] proposal for reworked flaky test category

Mon Oct 19 12:50:06 PDT 2015

Hi all,

I'd like unexpected successes (i.e. tests marked as unexpected failure that
in fact pass) to retain the actionable meaning that something is wrong.
The wrong part is that either (1) the test now passes consistently and the
author of the fix just missed updating the test definition (or perhaps was
unaware of the test), or (2) the test is not covering the condition it is
testing completely, and some change to the code just happened to make the
test pass (due to the test being not comprehensive enough).  Either of
those requires some sort of adjustment by the developers.

We have a category of test known as "flaky" or "flakey" (both are valid
spellings, for those who care:
http://www.merriam-webster.com/dictionary/flaky, although flaky is
considered the primary).  Flaky tests are tests that we can't get to pass
100% of the time.  This might be because it is extremely difficult to write
the test as such and deemed not worth the effort, or it is a condition that
is just not going to present itself successfully 100% of the time.  These
are tests we still want to exercise, but we don't want to have them start
generating test failures if they don't pass 100% of the time.  Currently
the flaky test mechanism requires a test to pass one in two times.  That is
okay for a test that exhibits a slim degree of flakiness.  For others, that
is not a large enough sample of runs to elicit a successful result.  Those
tests get marked as XFAIL, and generate a non-actionable "unexpected
success" result when they do happen to pass.

GOAL

* Enhance expectedFlakey* test decorators.  Allow specification of the
number of times in which a flaky test should be run to be expected to pass
at least once.  Call that MAX_RUNS.

* When running a flaky test, run it up MAX_RUNS number of times.  The first
time it passes, mark it as a successful test completion.  The test event
system will be given the number of times it was run before passing.
Whether we consume this info or not is TBD (and falls into the purview of
the test results formatter).

* If the test does not pass within MAX_RUNS time, mark it as a flaky fail.
For purposes of the standard output, this can look like FAIL: (flaky) or
something similar so fail scanners still see it.  (Note it's highly likely
I'll do the normal output counts with the TestResults formatter-based
output at the same time, so we get accurate test method counts and the
like).

* Flaky tests never generate a non-actionable "unexpected pass".  This
occurs because we no longer need to mark tests as XFAIL when they require
more than two runs to get a high degree of confidence in a passing test.

* Flaky tests get marked with a flaky category, so that test runners can
choose to skip flaky tests by skipping the category.  This may not be
necessary if tests don't take an excessively long time to get a passing
grade with high degree of confidence.

Let me know what you all think.  Once we come up with something, I'll
implement it.

-- 
-Todd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20151019/df2e3159/attachment.html>