[llvm-dev] False positive notifications around commit notifications

Fri Sep 10 18:01:58 PDT 2021

On Fri, Sep 10, 2021 at 5:58 PM Mehdi AMINI <joker.eph at gmail.com> wrote:

>
>
> On Fri, Sep 10, 2021 at 2:03 PM David Blaikie <dblaikie at gmail.com> wrote:
>
>> The folks on the Infrastructure Working Group:
>> https://foundation.llvm.org/docs/infrastructure-wg/ might have some
>> context on this (I think Christian Kuhnel is part of that & he's on this
>> list).
>>
>> On Fri, Sep 10, 2021 at 11:53 AM Philip Reames via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>>
>>> On 9/10/21 11:36 AM, Mehdi AMINI wrote:
>>>
>>>
>>>
>>> On Thu, Sep 9, 2021 at 3:18 PM Philip Reames via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> I've been noticing a trend where there is more and more false positive
>>>> email notifications sent out on valid commits.  This is getting really
>>>> problematic as real signal is being lost in the noise.  I've had several
>>>> cases in the last few weeks where I did not see a "real" failure notice
>>>> because it was buried in a bunch of false positives.
>>>>
>>>> Let me run through a few sources of what I consider false positives,
>>>> and suggest a couple things we could do to clean these up.  Note that the
>>>> recommendations here are entirely independent and we can adopt any subset.
>>>>
>>>> *Slow Try Bots*
>>>>
>>>> ex: "This revision was landed with ongoing or failed builds." on
>>>> https://reviews.llvm.org/D109091
>>>>
>>>> Someone - I'm not really sure who - enabled builds for all reviews, and
>>>> this notice on landed commits.  Given it's utterly routine to make a last
>>>> few style fixes before landing an LGTMed change
>>>>
>>>
>>> I do such "few style fixes", but I don't re-upload a revision before
>>> landing, so I don't see this "false positive" in general.
>>>
>>> I don't explicit upload the final patch either, but something in the
>>> close automation does.
>>>
>>> What I frequently see is that the pre-merge config is broken for some
>>> other reason, and that's quite annoying. One aspect of the issue is that
>>> the is no buildbot tracking the pre-merge configuration so it can be broken
>>> without notification (there is a buildkite job tracking it, but buildkite
>>> does not support blamelist notifications).
>>>
>>> Hm, maybe I misinterpreted the cause of these entirely?  Your
>>> explanation sounds plausible as well.
>>>
>>> If your explanation is correct, that would lean strongly to the "just
>>> disable" option.
>>>
>>> Do you know who to contact about this?  (i.e. Who owns the automation
>>> here?  Or where is the appropriate code to adjust?)
>>>
>> Looks like: https://reviews.llvm.org/harbormaster/plan/5/ indicates it
>> was setup by https://reviews.llvm.org/p/goncharov/ (added them to the
>> "to" line on this email) Also looks like Kuhnel (
>> https://reviews.llvm.org/H576 ) has sometthing to do with it, so I've
>> added them too.
>>
>>> , I consider this notice complete noise.  In practice, almost review
>>>> gets tagged this way.  To be clear, there is value in being told about
>>>> changes which don't build.  The false positive part is only around the
>>>> "ongoing" builds.
>>>>
>>>> Recommendation: Disable this message for the "ongoing" build case, and
>>>> if we can't, disable them entirely.
>>>>
>>>> *Flaky Builders*
>>>>
>>>> ex: https://lab.llvm.org/buildbot/#/builders/68/builds/18250
>>>>
>>>> We have many build bots which are not entirely stable.  It's gotten to
>>>> the point where I *expect* failure notifications on literally every change
>>>> I land.  I've been trying to reach out to individual build bot owners to
>>>> get issues resolved, and to their credit, most owners have been very
>>>> responsive.  However, we have enough builders that the situation isn't
>>>> getting meaningful better.
>>>>
>>>> Recommendation: Introduce specific "test commits" whose only purpose is
>>>> to run the CI infrastructure.  Any builder which notifies of failure on
>>>> such a commit (and only said commit) is disabled without discussion until
>>>> human action is taken by the bot owner to re-enable.  The idea here is to
>>>> a) automate the process, and b) shift the responsibility of action to the
>>>> bot owner for any flaky bot.
>>>>
>>>> Note: By "disabled", I specifically mean that *notification* is
>>>> disabled.  Leaving it in the waterfall view is fine, as long as we're not
>>>> sending out email about it.
>>>>
>>>> Aside: It's really tempting to attempt to separate builders which are
>>>> "still failing" (e.g. a rare configuration which has been broken for a few
>>>> days) from "flaky" ones.  I'd argue any bot notifying on a "still failing"
>>>> case is buggy, and thus it's fine to treat them the same as a "flaky" bot.
>>>>
>>>>
>>>> *Slow Builders and Redundant Notices *
>>>>
>>>> ex: https://lab.llvm.org/buildbot#builders/67/builds/4128
>>>>
>>>> Occasionally, we have a bad commit land which breaks every (or nearly
>>>> every) builder.  That happens.  If you happen to land a change just before
>>>> or after it, you then get on the blame list for every slow running builder
>>>> we have (since they tend to have large commit windows) if they happen to
>>>> cycle before the fix is committed.  This is particularly annoying since the
>>>> root issue is likely fixed quickly, but due to cycle times on the builders,
>>>> you may be getting emails for 24 hours to come.
>>>>
>>>> Recommendation: Introduce a new requirement for "slow" builders (say
>>>> cycle time of > 30 minutes) either a) have a maximum commit window of ~15
>>>> commits, or b) use a staged builder model.  Personally, I'd prefer the
>>>> staged model, but the max commit window at least helps to limit the
>>>> damage.
>>>>
>>>> By "staged builder model", I mean that slow builders only build points
>>>> in the history which have already been successfully build by one of the
>>>> fast builders.  This eliminates redundant build failures, at the cost of
>>>> delaying the slow builder slightly.  As long as the slow builder uses the
>>>> "last good commit" as opposed to waiting until the current fast builder
>>>> finishes, the delay should be very minimal for most commits.
>>>>
>>>
>>> Does buildbot support staged builders? That would really be ideal indeed!
>>> If we could also disable notification to the blamelist when it is larger
>>> than 5, that'd be great!
>>>
>>> I'll be honest here and say I don't know what buildbot natively
>>> supports.  Even if it doesn't, there are "easy" process workarounds to
>>> achieve the same effect.  Just as an example (i.e. definitely not proposing
>>> this as technical solution to be implemented right now), we could introduce
>>> a new branch in git called e.g. "buildbot-tracking-slow" and have a
>>> specific fast builder do a fast forward merge from main into this branch.
>>> All "slow" builders would simply follow this branch and not main.
>>>
>>> If we get consensus that this is the right approach, I am willing to put
>>> some of my own time to figuring out how to implement this.  For my own
>>> volunteer time, I'd probably start with the flaky bot test commit piece
>>> just because that's much easier to do manually first and then automate, and
>>> because I find them personally more annoying.
>>>
>> There was some prototype buildbot based (so far as I can tell) staged
>> builder setup years ago, it seems:
>> https://marc.info/?l=cfe-dev&m=136442525121902&w=2 - perhaps there's
>> some commit history in the zorg repo that shows how it was configured.
>>
>>> Your point about disabling notification on a blamelist larger than 5
>>> seems reasonable to me, but I'd definitely consider "build chunks of no
>>> more than N commits" and "build arbitrary sets, but only notify if less
>>> than M people" as distinct possibilities to be evaluated independently.
>>>
>> "chunks of no more than N commits" risks slower builders getting behind -
>> presumably you'd need some catchup mechanism (run a build with all the
>> available changes, but don't notify) if they just kept falling further and
>> further behind.
>>
>
> Slightly off-topic, but relevant for slow builders: switching to using
> `ccache` made a huge difference in for one of our builder (
> https://lab.llvm.org/buildbot/#/builders/61 ) ; it'll occasionally spend
> 25-30 min but otherwise is mostly taking only ~4min. Maybe we could advise
> this more widely as well?
>

Sounds alright. I'd love to see incremental builds be made reliable enough
that builders would depend on them - not that that would completely
invalidate benefits of ccache, but presumably a lot of the benefit comes
when doing clean builds (& I get why clean builds are desirable - changes
to CMake, etc, don't always work perfectly without a clean rebuild today)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210910/bbb54829/attachment-0001.html>