[llvm-dev] False positive notifications around commit notifications

Fri Sep 10 11:36:56 PDT 2021

On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> I've been noticing a trend where there is more and more false positive
> email notifications sent out on valid commits.  This is getting really
> problematic as real signal is being lost in the noise.  I've had several
> cases in the last few weeks where I did not see a "real" failure notice
> because it was buried in a bunch of false positives.
>
> Let me run through a few sources of what I consider false positives, and
> suggest a couple things we could do to clean these up.  Note that the
> recommendations here are entirely independent and we can adopt any subset.
>
> *Slow Try Bots*
>
> ex: "This revision was landed with ongoing or failed builds." on
> https://reviews.llvm.org/D109091
>
> Someone - I'm not really sure who - enabled builds for all reviews, and
> this notice on landed commits.  Given it's utterly routine to make a last
> few style fixes before landing an LGTMed change
>

I do such "few style fixes", but I don't re-upload a revision before
landing, so I don't see this "false positive" in general.
What I frequently see is that the pre-merge config is broken for some other
reason, and that's quite annoying. One aspect of the issue is that the is
no buildbot tracking the pre-merge configuration so it can be broken
without notification (there is a buildkite job tracking it, but buildkite
does not support blamelist notifications).

> , I consider this notice complete noise.  In practice, almost review gets
> tagged this way.  To be clear, there is value in being told about changes
> which don't build.  The false positive part is only around the "ongoing"
> builds.
>
> Recommendation: Disable this message for the "ongoing" build case, and if
> we can't, disable them entirely.
>
> *Flaky Builders*
>
> ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250
>
> We have many build bots which are not entirely stable.  It's gotten to the
> point where I *expect* failure notifications on literally every change I
> land.  I've been trying to reach out to individual build bot owners to get
> issues resolved, and to their credit, most owners have been very
> responsive.  However, we have enough builders that the situation isn't
> getting meaningful better.
>
> Recommendation: Introduce specific "test commits" whose only purpose is to
> run the CI infrastructure.  Any builder which notifies of failure on such a
> commit (and only said commit) is disabled without discussion until human
> action is taken by the bot owner to re-enable.  The idea here is to a)
> automate the process, and b) shift the responsibility of action to the bot
> owner for any flaky bot.
>
> Note: By "disabled", I specifically mean that *notification* is disabled.
> Leaving it in the waterfall view is fine, as long as we're not sending out
> email about it.
>
> Aside: It's really tempting to attempt to separate builders which are
> "still failing" (e.g. a rare configuration which has been broken for a few
> days) from "flaky" ones.  I'd argue any bot notifying on a "still failing"
> case is buggy, and thus it's fine to treat them the same as a "flaky" bot.
>
>
> *Slow Builders and Redundant Notices *
>
> ex: https://lab.llvm.org/buildbot#builders/67/builds/4128
>
> Occasionally, we have a bad commit land which breaks every (or nearly
> every) builder.  That happens.  If you happen to land a change just before
> or after it, you then get on the blame list for every slow running builder
> we have (since they tend to have large commit windows) if they happen to
> cycle before the fix is committed.  This is particularly annoying since the
> root issue is likely fixed quickly, but due to cycle times on the builders,
> you may be getting emails for 24 hours to come.
>
> Recommendation: Introduce a new requirement for "slow" builders (say cycle
> time of > 30 minutes) either a) have a maximum commit window of ~15
> commits, or b) use a staged builder model.  Personally, I'd prefer the
> staged model, but the max commit window at least helps to limit the
> damage.
>
> By "staged builder model", I mean that slow builders only build points in
> the history which have already been successfully build by one of the fast
> builders.  This eliminates redundant build failures, at the cost of
> delaying the slow builder slightly.  As long as the slow builder uses the
> "last good commit" as opposed to waiting until the current fast builder
> finishes, the delay should be very minimal for most commits.
>

Does buildbot support staged builders? That would really be ideal indeed!
If we could also disable notification to the blamelist when it is larger
than 5, that'd be great!

Cheers,

-- 
Mehdi

> Philip
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210910/c8dec7e9/attachment.html>