[lldb-dev] proposal for reworked flaky test category

Zachary Turner via lldb-dev lldb-dev at lists.llvm.org
Mon Oct 19 16:40:22 PDT 2015


Yea, I definitely agree with you there.

Is this going to end up with an @expectedFlakeyWindows,
@expectedFlakeyLinux, @expectedFlakeyDarwin, @expectedFlakeyAndroid,
@expectedFlakeyFreeBSD?

It's starting to get a little crazy, at some point I think we just need
something that we can use like this:

@test_status(status=flaky, host=[win, linux, android, darwin, bsd],
target=[win, linux, android, darwin, bsd], compiler=[gcc, clang],
debug_info=[dsym, dwarf, dwo])

On Mon, Oct 19, 2015 at 4:35 PM Todd Fiala <todd.fiala at gmail.com> wrote:

> My initial proposal was an attempt to not entirely skip running them on
> our end and still get them to generate actionable signals without
> conflating them with unexpected successes (which they absolutely are not in
> a semantic way).
>
> On Mon, Oct 19, 2015 at 4:33 PM, Todd Fiala <todd.fiala at gmail.com> wrote:
>
>> Nope, I have no issue with what you said.  We don't want to run them over
>> here at all because we don't see enough useful info come out of them.  You
>> need time series data for that to be somewhat useful, and even then it only
>> is useful if you see a sharp change in it after a specific change.
>>
>> So I really don't want to be running flaky tests at all as their signals
>> are not useful on a per-run basis.
>>
>> On Mon, Oct 19, 2015 at 4:16 PM, Zachary Turner <zturner at google.com>
>> wrote:
>>
>>> Don't get me wrong, I like the idea of running flakey tests a couple of
>>> times and seeing if one passes (Chromium does this too as well, so it's not
>>> without precedent).  If I sounded harsh, it's because I *want* to be harsh
>>> on flaky tests.  Flaky tests indicate literally the *worst* kind of bugs
>>> because you don't even know what kind of problems they're causing in the
>>> wild, so by increasing the amount of pain they cause people (test suite
>>> running longer, etc) the hope is that it will motivate someone to fix it.
>>>
>>> On Mon, Oct 19, 2015 at 4:04 PM Todd Fiala <todd.fiala at gmail.com> wrote:
>>>
>>>> Okay, so I'm not a fan of the flaky tests myself, nor of test suites
>>>> taking longer to run than needed.
>>>>
>>>> Enrico is going to add a new 'flakey' category to the test
>>>> categorization.
>>>>
>>>> Scratch all the other complexity I offered up.  What we're going to ask
>>>> is if a test is flakey, please add it to the 'flakey' category.  We won't
>>>> do anything different with the category by default, so everyone will still
>>>> get flakey tests running the same manner they do now.  However, on our test
>>>> runners, we will be disabling the category entirely using the
>>>> skipCategories mechanism since those are generating too much noise.
>>>>
>>>> We may need to add a per-test-method category mechanism since right now
>>>> our only mechanism to add categories (1) specify a dot-file to the
>>>> directory to have everything in it get tagged with a category, or (2)
>>>> override the categorization for the TestCase getCategories() mechanism.
>>>>
>>>> -Todd
>>>>
>>>> On Mon, Oct 19, 2015 at 1:03 PM, Zachary Turner <zturner at google.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 19, 2015 at 12:50 PM Todd Fiala via lldb-dev <
>>>>> lldb-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'd like unexpected successes (i.e. tests marked as unexpected
>>>>>> failure that in fact pass) to retain the actionable meaning that something
>>>>>> is wrong.  The wrong part is that either (1) the test now passes
>>>>>> consistently and the author of the fix just missed updating the test
>>>>>> definition (or perhaps was unaware of the test), or (2) the test is not
>>>>>> covering the condition it is testing completely, and some change to the
>>>>>> code just happened to make the test pass (due to the test being not
>>>>>> comprehensive enough).  Either of those requires some sort of adjustment by
>>>>>> the developers.
>>>>>>
>>>>> I'dd add #3.  The test is actually flaky but is tagged incorrectly.
>>>>>
>>>>>
>>>>>>
>>>>>> We have a category of test known as "flaky" or "flakey" (both are
>>>>>> valid spellings, for those who care:
>>>>>> http://www.merriam-webster.com/dictionary/flaky, although flaky is
>>>>>> considered the primary).  Flaky tests are tests that we can't get to pass
>>>>>> 100% of the time.  This might be because it is extremely difficult to write
>>>>>> the test as such and deemed not worth the effort, or it is a condition that
>>>>>> is just not going to present itself successfully 100% of the time.
>>>>>>
>>>>> IMO if it's not worth the effort to write the test correctly, we
>>>>> should delete the test.  Flaky is useful as a temporary status, but if
>>>>> nobody ends up fixing the flakiness, I think the test should be deleted
>>>>> (more reasons follow).
>>>>>
>>>>>
>>>>>
>>>>>> These are tests we still want to exercise, but we don't want to have
>>>>>> them start generating test failures if they don't pass 100% of the time.
>>>>>> Currently the flaky test mechanism requires a test to pass one in two
>>>>>> times.  That is okay for a test that exhibits a slim degree of flakiness.
>>>>>> For others, that is not a large enough sample of runs to elicit a
>>>>>> successful result.  Those tests get marked as XFAIL, and generate a
>>>>>> non-actionable "unexpected success" result when they do happen to pass.
>>>>>>
>>>>>> GOAL
>>>>>>
>>>>>> * Enhance expectedFlakey* test decorators.  Allow specification of
>>>>>> the number of times in which a flaky test should be run to be expected to
>>>>>> pass at least once.  Call that MAX_RUNS.
>>>>>>
>>>>> I think it's worth considering it it's a good idea include the date at
>>>>> which they were declared flakey.  After a certain amount of time has
>>>>> passed, if it's still flakey they can be relegated to hard failures.  I
>>>>> don't think flakey should be a permanent state.
>>>>>
>>>>>
>>>>>>
>>>>>> * When running a flaky test, run it up MAX_RUNS number of times.  The
>>>>>> first time it passes, mark it as a successful test completion.  The test
>>>>>> event system will be given the number of times it was run before passing.
>>>>>> Whether we consume this info or not is TBD (and falls into the purview of
>>>>>> the test results formatter).
>>>>>>
>>>>>
>>>>>> * If the test does not pass within MAX_RUNS time, mark it as a flaky
>>>>>> fail.  For purposes of the standard output, this can look like FAIL:
>>>>>> (flaky) or something similar so fail scanners still see it.  (Note it's
>>>>>> highly likely I'll do the normal output counts with the TestResults
>>>>>> formatter-based output at the same time, so we get accurate test method
>>>>>> counts and the like).
>>>>>>
>>>>> The concern I have here (and the reason I would like to delete flakey
>>>>> tests if the flakiness isn't removed after  certain amount of time) is
>>>>> because some of our tests are slow.  Repeating them many times is going to
>>>>> have an impact on how long the test suite takes to run.  It's already
>>>>> tripled over the past 3 weeks, and I think we need to be careful to keep
>>>>> out things that have the potential to lead to significant slowness of the
>>>>> test suite runner.
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> -Todd
>>>>
>>>
>>
>>
>> --
>> -Todd
>>
>
>
>
> --
> -Todd
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20151019/c13b972f/attachment.html>


More information about the lldb-dev mailing list