[llvm-dev] False positive notifications around commit notifications

Mehdi AMINI via llvm-dev llvm-dev at lists.llvm.org
Fri Sep 10 17:57:43 PDT 2021


On Fri, Sep 10, 2021 at 2:03 PM David Blaikie <dblaikie at gmail.com> wrote:

> The folks on the Infrastructure Working Group:
> https://foundation.llvm.org/docs/infrastructure-wg/ might have some
> context on this (I think Christian Kuhnel is part of that & he's on this
> list).
>
> On Fri, Sep 10, 2021 at 11:53 AM Philip Reames via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>>
>> On 9/10/21 11:36 AM, Mehdi AMINI wrote:
>>
>>
>>
>> On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> I've been noticing a trend where there is more and more false positive
>>> email notifications sent out on valid commits.  This is getting really
>>> problematic as real signal is being lost in the noise.  I've had several
>>> cases in the last few weeks where I did not see a "real" failure notice
>>> because it was buried in a bunch of false positives.
>>>
>>> Let me run through a few sources of what I consider false positives, and
>>> suggest a couple things we could do to clean these up.  Note that the
>>> recommendations here are entirely independent and we can adopt any subset.
>>>
>>> *Slow Try Bots*
>>>
>>> ex: "This revision was landed with ongoing or failed builds." on
>>> https://reviews.llvm.org/D109091
>>>
>>> Someone - I'm not really sure who - enabled builds for all reviews, and
>>> this notice on landed commits.  Given it's utterly routine to make a last
>>> few style fixes before landing an LGTMed change
>>>
>>
>> I do such "few style fixes", but I don't re-upload a revision before
>> landing, so I don't see this "false positive" in general.
>>
>> I don't explicit upload the final patch either, but something in the
>> close automation does.
>>
>> What I frequently see is that the pre-merge config is broken for some
>> other reason, and that's quite annoying. One aspect of the issue is that
>> the is no buildbot tracking the pre-merge configuration so it can be broken
>> without notification (there is a buildkite job tracking it, but buildkite
>> does not support blamelist notifications).
>>
>> Hm, maybe I misinterpreted the cause of these entirely?  Your explanation
>> sounds plausible as well.
>>
>> If your explanation is correct, that would lean strongly to the "just
>> disable" option.
>>
>> Do you know who to contact about this?  (i.e. Who owns the automation
>> here?  Or where is the appropriate code to adjust?)
>>
> Looks like: https://reviews.llvm.org/harbormaster/plan/5/ indicates it
> was setup by https://reviews.llvm.org/p/goncharov/ (added them to the
> "to" line on this email) Also looks like Kuhnel (
> https://reviews.llvm.org/H576 ) has sometthing to do with it, so I've
> added them too.
>
>> , I consider this notice complete noise.  In practice, almost review gets
>>> tagged this way.  To be clear, there is value in being told about changes
>>> which don't build.  The false positive part is only around the "ongoing"
>>> builds.
>>>
>>> Recommendation: Disable this message for the "ongoing" build case, and
>>> if we can't, disable them entirely.
>>>
>>> *Flaky Builders*
>>>
>>> ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250
>>>
>>> We have many build bots which are not entirely stable.  It's gotten to
>>> the point where I *expect* failure notifications on literally every change
>>> I land.  I've been trying to reach out to individual build bot owners to
>>> get issues resolved, and to their credit, most owners have been very
>>> responsive.  However, we have enough builders that the situation isn't
>>> getting meaningful better.
>>>
>>> Recommendation: Introduce specific "test commits" whose only purpose is
>>> to run the CI infrastructure.  Any builder which notifies of failure on
>>> such a commit (and only said commit) is disabled without discussion until
>>> human action is taken by the bot owner to re-enable.  The idea here is to
>>> a) automate the process, and b) shift the responsibility of action to the
>>> bot owner for any flaky bot.
>>>
>>> Note: By "disabled", I specifically mean that *notification* is
>>> disabled.  Leaving it in the waterfall view is fine, as long as we're not
>>> sending out email about it.
>>>
>>> Aside: It's really tempting to attempt to separate builders which are
>>> "still failing" (e.g. a rare configuration which has been broken for a few
>>> days) from "flaky" ones.  I'd argue any bot notifying on a "still failing"
>>> case is buggy, and thus it's fine to treat them the same as a "flaky" bot.
>>>
>>>
>>> *Slow Builders and Redundant Notices *
>>>
>>> ex: https://lab.llvm.org/buildbot#builders/67/builds/4128
>>>
>>> Occasionally, we have a bad commit land which breaks every (or nearly
>>> every) builder.  That happens.  If you happen to land a change just before
>>> or after it, you then get on the blame list for every slow running builder
>>> we have (since they tend to have large commit windows) if they happen to
>>> cycle before the fix is committed.  This is particularly annoying since the
>>> root issue is likely fixed quickly, but due to cycle times on the builders,
>>> you may be getting emails for 24 hours to come.
>>>
>>> Recommendation: Introduce a new requirement for "slow" builders (say
>>> cycle time of > 30 minutes) either a) have a maximum commit window of ~15
>>> commits, or b) use a staged builder model.  Personally, I'd prefer the
>>> staged model, but the max commit window at least helps to limit the
>>> damage.
>>>
>>> By "staged builder model", I mean that slow builders only build points
>>> in the history which have already been successfully build by one of the
>>> fast builders.  This eliminates redundant build failures, at the cost of
>>> delaying the slow builder slightly.  As long as the slow builder uses the
>>> "last good commit" as opposed to waiting until the current fast builder
>>> finishes, the delay should be very minimal for most commits.
>>>
>>
>> Does buildbot support staged builders? That would really be ideal indeed!
>> If we could also disable notification to the blamelist when it is larger
>> than 5, that'd be great!
>>
>> I'll be honest here and say I don't know what buildbot natively
>> supports.  Even if it doesn't, there are "easy" process workarounds to
>> achieve the same effect.  Just as an example (i.e. definitely not proposing
>> this as technical solution to be implemented right now), we could introduce
>> a new branch in git called e.g. "buildbot-tracking-slow" and have a
>> specific fast builder do a fast forward merge from main into this branch.
>> All "slow" builders would simply follow this branch and not main.
>>
>> If we get consensus that this is the right approach, I am willing to put
>> some of my own time to figuring out how to implement this.  For my own
>> volunteer time, I'd probably start with the flaky bot test commit piece
>> just because that's much easier to do manually first and then automate, and
>> because I find them personally more annoying.
>>
> There was some prototype buildbot based (so far as I can tell) staged
> builder setup years ago, it seems:
> https://marc.info/?l=cfe-dev&m=136442525121902&w=2 - perhaps there's some
> commit history in the zorg repo that shows how it was configured.
>
>> Your point about disabling notification on a blamelist larger than 5
>> seems reasonable to me, but I'd definitely consider "build chunks of no
>> more than N commits" and "build arbitrary sets, but only notify if less
>> than M people" as distinct possibilities to be evaluated independently.
>>
> "chunks of no more than N commits" risks slower builders getting behind -
> presumably you'd need some catchup mechanism (run a build with all the
> available changes, but don't notify) if they just kept falling further and
> further behind.
>

Slightly off-topic, but relevant for slow builders: switching to using
`ccache` made a huge difference in for one of our builder (
https://lab.llvm.org/buildbot/#/builders/61 ) ; it'll occasionally spend
25-30 min but otherwise is mostly taking only ~4min. Maybe we could advise
this more widely as well?

-- 
Mehdi

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210910/4f7e3eac/attachment.html>


More information about the llvm-dev mailing list