[lldb-dev] TestRaise.py test_restart_bug flakey stats

Mon Oct 19 04:30:22 PDT 2015

The expected flakey works a bit differently then you are described:
* Run the tests
* If it passes, it goes as a successful test and we are done
* Run the test again
* If it is passes the 2nd time then record it as expected failure (IMO
expected falkey would be a better result, but we don't have that category)
* If it fails 2 times in a row then record it as a failure because a flakey
test should pass at least once in every 2 run (it means we need ~95%
success rate to keep the build bot green in most of the time). If it isn't
passing often enough for that then it should be marked as expected failure.
This is done this way to detect the case when a flakey test get broken
completely by a new change.

I checked some states for TestRaise on the build bot and in the current
definition of expected flakey we shouldn't mark it as flakey because it
will often fail 2 times in a row (it passing rate is ~50%) what will be
reported as a failure making the build bot red.

I will send you the full stats from the lass 100 build in a separate off
list mail as it is a too big for the mailing list. If somebody else is
interested in it then let me know.

Tamas

On Sun, Oct 18, 2015 at 2:18 AM Todd Fiala <todd.fiala at gmail.com> wrote:

> Nope, no good either when I limit the flakey to DWO.
>
> So perhaps I don't understand how the flakey marking works.  I thought it
> meant:
> * run the test.
> * If it passes, it goes as a successful test.  Then we're done.
> * run the test again.
> * If it passes, then we're done and mark it a successful test.  If it
> fails, then mark it an expected failure.
>
> But that's definitely not the behavior I'm seeing, as a flakey marking in
> the above scheme should never produce a failing test.
>
> I'll have to revisit the flakey test marking to see what it's really doing
> since my understanding is clearly flawed!
>
> On Sat, Oct 17, 2015 at 5:57 PM, Todd Fiala <todd.fiala at gmail.com> wrote:
>
>> Hmm, the flakey behavior may be specific to dwo.  Testing it locally as
>> unconditionally flaky on Linux is failing on dwarf.  All the ones I see
>> succeed are dwo.  I wouldn't expect a diff there but that seems to be the
>> case.
>>
>> So, the request still stands but I won't be surprised if we find that dwo
>> sometimes passes while dwarf doesn't (or at least not enough to get through
>> the flakey setting).
>>
>> On Sat, Oct 17, 2015 at 4:57 PM, Todd Fiala <todd.fiala at gmail.com> wrote:
>>
>>> Hi Tamas,
>>>
>>> I think you grabbed me stats on failing tests in the past.  Can you dig
>>> up the failure rate for TestRaise.py's test_restart_bug() variants on
>>> Ubuntu 14.04 x86_64?  I'd like to mark it as flaky on Linux, since it is
>>> passing most of the time over here.  But I want to see if that's valid
>>> across all Ubuntu 14.04 x86_64.  (If it is passing some of the time, I'd
>>> prefer marking it flakey so that we don't see unexpected successes).
>>>
>>> Thanks!
>>>
>>> --
>>> -Todd
>>>
>>
>>
>>
>> --
>> -Todd
>>
>
>
>
> --
> -Todd
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20151019/5d1abb20/attachment.html>