[lldb-dev] TestRaise.py test_restart_bug flakey stats

Todd Fiala via lldb-dev lldb-dev at lists.llvm.org
Mon Oct 19 08:47:46 PDT 2015


Thanks, Tamas.

On Mon, Oct 19, 2015 at 4:30 AM, Tamas Berghammer <tberghammer at google.com>
wrote:

> The expected flakey works a bit differently then you are described:
> * Run the tests
> * If it passes, it goes as a successful test and we are done
> * Run the test again
> * If it is passes the 2nd time then record it as expected failure (IMO
> expected falkey would be a better result, but we don't have that category)
>

I agree.  I plan to add that category (I think I even have a bugzilla bug I
created for myself on that).  The intent would be to have a "pass flakey"
and "fail flakey" end state for a run.  How many times to run and
entry/exit from run TBD.  If we mark it right, and we know how many times
we should be able to run it to have a single pass, we could really do this
right.


> * If it fails 2 times in a row then record it as a failure because a
> flakey test should pass at least once in every 2 run (it means we need ~95%
> success rate to keep the build bot green in most of the time). If it isn't
> passing often enough for that then it should be marked as expected failure.
> This is done this way to detect the case when a flakey test get broken
> completely by a new change.
>
>
I see.  Thanks.  That totally explains what I was seeing.

Internally I have been using "unexpected success' as an actionable item,
failing our testbots.  The idea being is that if something is supposed to
fail and it is now passing, that indicates either (1) somebody fixed it
with a change and didn't update the test as a oversight, (2) somebody fixed
it with a change that shouldn't have fixed it, and an issue with the test
logic is not testing something properly, and the test should be updated.

That is kind of stymied by this type of test result, as unexpected success
becomes a "sometimes meaningless" signal.  And anything that is sometimes
meaningless can make the meaningful ones get overlooked.

So I would actively like to move away from unexpected success containing a
sometimes useful / sometimes not useful semantic.  We should tackle that
soon.



> I checked some states for TestRaise on the build bot and in the current
> definition of expected flakey we shouldn't mark it as flakey because it
> will often fail 2 times in a row (it passing rate is ~50%) what will be
> reported as a failure making the build bot red.
>
>
I will send you the full stats from the lass 100 build in a separate off
> list mail as it is a too big for the mailing list. If somebody else is
> interested in it then let me know.
>
>
Thanks, Tamas!


> Tamas
>
> On Sun, Oct 18, 2015 at 2:18 AM Todd Fiala <todd.fiala at gmail.com> wrote:
>
>> Nope, no good either when I limit the flakey to DWO.
>>
>> So perhaps I don't understand how the flakey marking works.  I thought it
>> meant:
>> * run the test.
>> * If it passes, it goes as a successful test.  Then we're done.
>> * run the test again.
>> * If it passes, then we're done and mark it a successful test.  If it
>> fails, then mark it an expected failure.
>>
>> But that's definitely not the behavior I'm seeing, as a flakey marking in
>> the above scheme should never produce a failing test.
>>
>> I'll have to revisit the flakey test marking to see what it's really
>> doing since my understanding is clearly flawed!
>>
>> On Sat, Oct 17, 2015 at 5:57 PM, Todd Fiala <todd.fiala at gmail.com> wrote:
>>
>>> Hmm, the flakey behavior may be specific to dwo.  Testing it locally as
>>> unconditionally flaky on Linux is failing on dwarf.  All the ones I see
>>> succeed are dwo.  I wouldn't expect a diff there but that seems to be the
>>> case.
>>>
>>> So, the request still stands but I won't be surprised if we find that
>>> dwo sometimes passes while dwarf doesn't (or at least not enough to get
>>> through the flakey setting).
>>>
>>> On Sat, Oct 17, 2015 at 4:57 PM, Todd Fiala <todd.fiala at gmail.com>
>>> wrote:
>>>
>>>> Hi Tamas,
>>>>
>>>> I think you grabbed me stats on failing tests in the past.  Can you dig
>>>> up the failure rate for TestRaise.py's test_restart_bug() variants on
>>>> Ubuntu 14.04 x86_64?  I'd like to mark it as flaky on Linux, since it is
>>>> passing most of the time over here.  But I want to see if that's valid
>>>> across all Ubuntu 14.04 x86_64.  (If it is passing some of the time, I'd
>>>> prefer marking it flakey so that we don't see unexpected successes).
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> -Todd
>>>>
>>>
>>>
>>>
>>> --
>>> -Todd
>>>
>>
>>
>>
>> --
>> -Todd
>>
>


-- 
-Todd
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/lldb-dev/attachments/20151019/2f6c4c5b/attachment.html>


More information about the lldb-dev mailing list