[llvm-dev] False positive notifications around commit notifications

David Blaikie via llvm-dev llvm-dev at lists.llvm.org
Fri Sep 10 14:02:58 PDT 2021


The folks on the Infrastructure Working Group:
https://foundation.llvm.org/docs/infrastructure-wg/ might have some context
on this (I think Christian Kuhnel is part of that & he's on this list).

On Fri, Sep 10, 2021 at 11:53 AM Philip Reames via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

>
> On 9/10/21 11:36 AM, Mehdi AMINI wrote:
>
>
>
> On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> I've been noticing a trend where there is more and more false positive
>> email notifications sent out on valid commits.  This is getting really
>> problematic as real signal is being lost in the noise.  I've had several
>> cases in the last few weeks where I did not see a "real" failure notice
>> because it was buried in a bunch of false positives.
>>
>> Let me run through a few sources of what I consider false positives, and
>> suggest a couple things we could do to clean these up.  Note that the
>> recommendations here are entirely independent and we can adopt any subset.
>>
>> *Slow Try Bots*
>>
>> ex: "This revision was landed with ongoing or failed builds." on
>> https://reviews.llvm.org/D109091
>>
>> Someone - I'm not really sure who - enabled builds for all reviews, and
>> this notice on landed commits.  Given it's utterly routine to make a last
>> few style fixes before landing an LGTMed change
>>
>
> I do such "few style fixes", but I don't re-upload a revision before
> landing, so I don't see this "false positive" in general.
>
> I don't explicit upload the final patch either, but something in the close
> automation does.
>
> What I frequently see is that the pre-merge config is broken for some
> other reason, and that's quite annoying. One aspect of the issue is that
> the is no buildbot tracking the pre-merge configuration so it can be broken
> without notification (there is a buildkite job tracking it, but buildkite
> does not support blamelist notifications).
>
> Hm, maybe I misinterpreted the cause of these entirely?  Your explanation
> sounds plausible as well.
>
> If your explanation is correct, that would lean strongly to the "just
> disable" option.
>
> Do you know who to contact about this?  (i.e. Who owns the automation
> here?  Or where is the appropriate code to adjust?)
>
Looks like: https://reviews.llvm.org/harbormaster/plan/5/ indicates it was
setup by https://reviews.llvm.org/p/goncharov/ (added them to the "to" line
on this email) Also looks like Kuhnel ( https://reviews.llvm.org/H576 ) has
sometthing to do with it, so I've added them too.

> , I consider this notice complete noise.  In practice, almost review gets
>> tagged this way.  To be clear, there is value in being told about changes
>> which don't build.  The false positive part is only around the "ongoing"
>> builds.
>>
>> Recommendation: Disable this message for the "ongoing" build case, and if
>> we can't, disable them entirely.
>>
>> *Flaky Builders*
>>
>> ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250
>>
>> We have many build bots which are not entirely stable.  It's gotten to
>> the point where I *expect* failure notifications on literally every change
>> I land.  I've been trying to reach out to individual build bot owners to
>> get issues resolved, and to their credit, most owners have been very
>> responsive.  However, we have enough builders that the situation isn't
>> getting meaningful better.
>>
>> Recommendation: Introduce specific "test commits" whose only purpose is
>> to run the CI infrastructure.  Any builder which notifies of failure on
>> such a commit (and only said commit) is disabled without discussion until
>> human action is taken by the bot owner to re-enable.  The idea here is to
>> a) automate the process, and b) shift the responsibility of action to the
>> bot owner for any flaky bot.
>>
>> Note: By "disabled", I specifically mean that *notification* is
>> disabled.  Leaving it in the waterfall view is fine, as long as we're not
>> sending out email about it.
>>
>> Aside: It's really tempting to attempt to separate builders which are
>> "still failing" (e.g. a rare configuration which has been broken for a few
>> days) from "flaky" ones.  I'd argue any bot notifying on a "still failing"
>> case is buggy, and thus it's fine to treat them the same as a "flaky" bot.
>>
>>
>> *Slow Builders and Redundant Notices *
>>
>> ex: https://lab.llvm.org/buildbot#builders/67/builds/4128
>>
>> Occasionally, we have a bad commit land which breaks every (or nearly
>> every) builder.  That happens.  If you happen to land a change just before
>> or after it, you then get on the blame list for every slow running builder
>> we have (since they tend to have large commit windows) if they happen to
>> cycle before the fix is committed.  This is particularly annoying since the
>> root issue is likely fixed quickly, but due to cycle times on the builders,
>> you may be getting emails for 24 hours to come.
>>
>> Recommendation: Introduce a new requirement for "slow" builders (say
>> cycle time of > 30 minutes) either a) have a maximum commit window of ~15
>> commits, or b) use a staged builder model.  Personally, I'd prefer the
>> staged model, but the max commit window at least helps to limit the
>> damage.
>>
>> By "staged builder model", I mean that slow builders only build points in
>> the history which have already been successfully build by one of the fast
>> builders.  This eliminates redundant build failures, at the cost of
>> delaying the slow builder slightly.  As long as the slow builder uses the
>> "last good commit" as opposed to waiting until the current fast builder
>> finishes, the delay should be very minimal for most commits.
>>
>
> Does buildbot support staged builders? That would really be ideal indeed!
> If we could also disable notification to the blamelist when it is larger
> than 5, that'd be great!
>
> I'll be honest here and say I don't know what buildbot natively supports.
> Even if it doesn't, there are "easy" process workarounds to achieve the
> same effect.  Just as an example (i.e. definitely not proposing this as
> technical solution to be implemented right now), we could introduce a new
> branch in git called e.g. "buildbot-tracking-slow" and have a specific fast
> builder do a fast forward merge from main into this branch.  All "slow"
> builders would simply follow this branch and not main.
>
> If we get consensus that this is the right approach, I am willing to put
> some of my own time to figuring out how to implement this.  For my own
> volunteer time, I'd probably start with the flaky bot test commit piece
> just because that's much easier to do manually first and then automate, and
> because I find them personally more annoying.
>
There was some prototype buildbot based (so far as I can tell) staged
builder setup years ago, it seems:
https://marc.info/?l=cfe-dev&m=136442525121902&w=2 - perhaps there's some
commit history in the zorg repo that shows how it was configured.

> Your point about disabling notification on a blamelist larger than 5 seems
> reasonable to me, but I'd definitely consider "build chunks of no more than
> N commits" and "build arbitrary sets, but only notify if less than M
> people" as distinct possibilities to be evaluated independently.
>
"chunks of no more than N commits" risks slower builders getting behind -
presumably you'd need some catchup mechanism (run a build with all the
available changes, but don't notify) if they just kept falling further and
further behind.

- Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210910/70a02d83/attachment.html>


More information about the llvm-dev mailing list