[lldb-dev] proposal for reworked flaky test category

Tue Oct 20 15:46:45 PDT 2015

I'm not totally sure yet here.

Right now there is a generic category mechanism, but it is only settable
via a file in a directory, or overridden via the test case class method
called getCategories().

I think I'd want a more general decorator that allows you to tag a method
itself with categories.

So, something like:

class TestRaise:
     @lldbtest.category("flakey")
     # or maybe better since it uses a constant
     @lldbtest.category(lldbtest.CATEGORY_FLAKEY)
     def test_something_that_does_not_always_pass():
          pass

Then a method has a collection of categories, and checking membership can
look at the class instance (to hook into the existing mechanism).

-Todd

On Mon, Oct 19, 2015 at 4:40 PM, Zachary Turner <zturner at google.com> wrote:

> Yea, I definitely agree with you there.
>
> Is this going to end up with an @expectedFlakeyWindows,
> @expectedFlakeyLinux, @expectedFlakeyDarwin, @expectedFlakeyAndroid,
> @expectedFlakeyFreeBSD?
>
> It's starting to get a little crazy, at some point I think we just need
> something that we can use like this:
>
> @test_status(status=flaky, host=[win, linux, android, darwin, bsd],
> target=[win, linux, android, darwin, bsd], compiler=[gcc, clang],
> debug_info=[dsym, dwarf, dwo])
>
> On Mon, Oct 19, 2015 at 4:35 PM Todd Fiala <todd.fiala at gmail.com> wrote:
>
>> My initial proposal was an attempt to not entirely skip running them on
>> our end and still get them to generate actionable signals without
>> conflating them with unexpected successes (which they absolutely are not in
>> a semantic way).
>>
>> On Mon, Oct 19, 2015 at 4:33 PM, Todd Fiala <todd.fiala at gmail.com> wrote:
>>
>>> Nope, I have no issue with what you said.  We don't want to run them
>>> over here at all because we don't see enough useful info come out of them.
>>> You need time series data for that to be somewhat useful, and even then it
>>> only is useful if you see a sharp change in it after a specific change.
>>>
>>> So I really don't want to be running flaky tests at all as their signals
>>> are not useful on a per-run basis.
>>>
>>> On Mon, Oct 19, 2015 at 4:16 PM, Zachary Turner <zturner at google.com>
>>> wrote:
>>>
>>>> Don't get me wrong, I like the idea of running flakey tests a couple of
>>>> times and seeing if one passes (Chromium does this too as well, so it's not
>>>> without precedent).  If I sounded harsh, it's because I *want* to be harsh
>>>> on flaky tests.  Flaky tests indicate literally the *worst* kind of bugs
>>>> because you don't even know what kind of problems they're causing in the
>>>> wild, so by increasing the amount of pain they cause people (test suite
>>>> running longer, etc) the hope is that it will motivate someone to fix it.
>>>>
>>>> On Mon, Oct 19, 2015 at 4:04 PM Todd Fiala <todd.fiala at gmail.com>
>>>> wrote:
>>>>
>>>>> Okay, so I'm not a fan of the flaky tests myself, nor of test suites
>>>>> taking longer to run than needed.
>>>>>
>>>>> Enrico is going to add a new 'flakey' category to the test
>>>>> categorization.
>>>>>
>>>>> Scratch all the other complexity I offered up.  What we're going to
>>>>> ask is if a test is flakey, please add it to the 'flakey' category.  We
>>>>> won't do anything different with the category by default, so everyone will
>>>>> still get flakey tests running the same manner they do now.  However, on
>>>>> our test runners, we will be disabling the category entirely using the
>>>>> skipCategories mechanism since those are generating too much noise.
>>>>>
>>>>> We may need to add a per-test-method category mechanism since right
>>>>> now our only mechanism to add categories (1) specify a dot-file to the
>>>>> directory to have everything in it get tagged with a category, or (2)
>>>>> override the categorization for the TestCase getCategories() mechanism.
>>>>>
>>>>> -Todd
>>>>>
>>>>> On Mon, Oct 19, 2015 at 1:03 PM, Zachary Turner <zturner at google.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Oct 19, 2015 at 12:50 PM Todd Fiala via lldb-dev <
>>>>>> lldb-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I'd like unexpected successes (i.e. tests marked as unexpected
>>>>>>> failure that in fact pass) to retain the actionable meaning that something
>>>>>>> is wrong.  The wrong part is that either (1) the test now passes
>>>>>>> consistently and the author of the fix just missed updating the test
>>>>>>> definition (or perhaps was unaware of the test), or (2) the test is not
>>>>>>> covering the condition it is testing completely, and some change to the
>>>>>>> code just happened to make the test pass (due to the test being not
>>>>>>> comprehensive enough).  Either of those requires some sort of adjustment by
>>>>>>> the developers.
>>>>>>>
>>>>>> I'dd add #3.  The test is actually flaky but is tagged incorrectly.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> We have a category of test known as "flaky" or "flakey" (both are
>>>>>>> valid spellings, for those who care:
>>>>>>> http://www.merriam-webster.com/dictionary/flaky, although flaky is
>>>>>>> considered the primary).  Flaky tests are tests that we can't get to pass
>>>>>>> 100% of the time.  This might be because it is extremely difficult to write
>>>>>>> the test as such and deemed not worth the effort, or it is a condition that
>>>>>>> is just not going to present itself successfully 100% of the time.
>>>>>>>
>>>>>> IMO if it's not worth the effort to write the test correctly, we
>>>>>> should delete the test.  Flaky is useful as a temporary status, but if
>>>>>> nobody ends up fixing the flakiness, I think the test should be deleted
>>>>>> (more reasons follow).
>>>>>>
>>>>>>
>>>>>>
>>>>>>> These are tests we still want to exercise, but we don't want to have
>>>>>>> them start generating test failures if they don't pass 100% of the time.
>>>>>>> Currently the flaky test mechanism requires a test to pass one in two
>>>>>>> times.  That is okay for a test that exhibits a slim degree of flakiness.
>>>>>>> For others, that is not a large enough sample of runs to elicit a
>>>>>>> successful result.  Those tests get marked as XFAIL, and generate a
>>>>>>> non-actionable "unexpected success" result when they do happen to pass.
>>>>>>>
>>>>>>> GOAL
>>>>>>>
>>>>>>> * Enhance expectedFlakey* test decorators.  Allow specification of
>>>>>>> the number of times in which a flaky test should be run to be expected to
>>>>>>> pass at least once.  Call that MAX_RUNS.
>>>>>>>
>>>>>> I think it's worth considering it it's a good idea include the date
>>>>>> at which they were declared flakey.  After a certain amount of time has
>>>>>> passed, if it's still flakey they can be relegated to hard failures.  I
>>>>>> don't think flakey should be a permanent state.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> * When running a flaky test, run it up MAX_RUNS number of times.
>>>>>>> The first time it passes, mark it as a successful test completion.  The
>>>>>>> test event system will be given the number of times it was run before
>>>>>>> passing.  Whether we consume this info or not is TBD (and falls into the
>>>>>>> purview of the test results formatter).
>>>>>>>
>>>>>>
>>>>>>> * If the test does not pass within MAX_RUNS time, mark it as a flaky
>>>>>>> fail.  For purposes of the standard output, this can look like FAIL:
>>>>>>> (flaky) or something similar so fail scanners still see it.  (Note it's
>>>>>>> highly likely I'll do the normal output counts with the TestResults
>>>>>>> formatter-based output at the same time, so we get accurate test method
>>>>>>> counts and the like).
>>>>>>>
>>>>>> The concern I have here (and the reason I would like to delete flakey
>>>>>> tests if the flakiness isn't removed after  certain amount of time) is
>>>>>> because some of our tests are slow.  Repeating them many times is going to
>>>>>> have an impact on how long the test suite takes to run.  It's already
>>>>>> tripled over the past 3 weeks, and I think we need to be careful to keep
>>>>>> out things that have the potential to lead to significant slowness of the
>>>>>> test suite runner.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -Todd
>>>>>
>>>>
>>>
>>>
>>> --
>>> -Todd
>>>
>>
>>
>>
>> --
>> -Todd
>>
>

-- 
-Todd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20151020/800e5bf6/attachment.html>