[llvm-dev] False positive notifications around commit notifications

Mon Oct 11 10:56:44 PDT 2021

Here's a fun one: https://lab.llvm.org/buildbot/#/builders/164/builds/3428
- a buildbot failure with a single blame (me) - but I hadn't committed in
the last few days, so I was confused. Turns out its from a change committed
3 months ago - and the failure is a timeout.

Given the number of buildbot timeout false positives, I honestly wouldn't
be averse to saying timeouts shouldn't produce fail-mail & are the
responsibility of buildbot owners to triage. I realize we can actually
submit code that leads to timeouts, but on balance that seems rare compared
to the number of times its a buildbot configuration issue instead. (though
open to debate on that for sure)

On Wed, Oct 6, 2021 at 4:08 AM Nemanja Ivanovic via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> I wonder if it would be possible to make some recommendations for
> improvements based on data rather than our collective anecdotal experience.
> Much as anyone else, I feel that the vast majority of the failure emails I
> get are not related, but I would have a lot of trouble quantifying it any
> better than a "gut feeling".
>
> Would it be possible to somehow acquire historical data from buildbots to
> help identify things that can improve. Perhaps:
> - Bot failures where none of the commits were reverted before the bot went
> back to green
> - For those failures, collect the test cases that failed - those might be
> flaky test cases if they show up frequently and/or on multiple bots
> - For bots that have many such instances (especially with different test
> cases every time), perhaps the bot itself is somehow flaky
>
> This is definitely an annoying problem that has significant consequences
> (real failures being missed due to many false failures), but it is a
> difficult problem to solve.
>
> On Wed, Sep 22, 2021 at 5:50 AM Martin Storsjö via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> On Wed, 22 Sep 2021, Florian Hahn via llvm-dev wrote:
>>
>> > Thanks for raising this issue! My experience matches what you are
>> > describing. The false positive rate for me is seems to be at least 10
>> false
>> > positives due to flakiness to 1 real failure.
>> > I think it would be good to have some sort of policy spelling out the
>> > requirements for having notification enabled for a buildbot, with a
>> process
>> > that makes it easy to disable flaky bots until the owners can make them
>> more
>> > stable. It would be good if notifications could be disabled without
>> > requiring contacting/interventions from individual owners, but I am not
>> sure
>> > if that’s possible with buildbot.
>>
>> Another aspect is that some tests can be flakey - they might work
>> seemingly fine in local testing but start showing up as timeouts/spurious
>> failures when run in a CI/buildbot setting. And due to their flakiness,
>> it's not evident when the breakage is introduced, but over time, such
>> flakey tests/setups do add up, to the situation we have today.
>>
>> // Martin
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211011/6ea6a83f/attachment.html>