[llvm-dev] Responsibilities of a buildbot owner

Wed Jan 12 21:18:55 PST 2022

On Wed, Jan 12, 2022 at 7:33 PM Galina Kistanova <gkistanova at gmail.com>
wrote:

> Hello everyone,
>
> In continuation of the Responsibilities of a buildbot owner thread.
>
> First of all, thank you very much for being buildbot owners! This is much
> appreciated.
> Thank you for bringing good points to the discussion.
>
> It is expected that buildbot owners own bots which are reliable,
> informative and helpful to the community.
>
> Effectively that means if a problem is detected by a builder and it is
> hard to pinpoint the reason of the issue and a commit to blame, a buildbot
> owner is natively on the escalation path. Someone has to get to the root of
> the problem and fix it one way or another (by reverting the commit, or by
> proposing a patch, or by working with the author of the commit which
> introduced the issue). In the majority of the cases someone takes care of
> an issue. But sometimes it takes a buildbot owner to push. Every buildbot
> owner does this from time to time.
>
> Hi Mehdi,
>
> > Something quite annoying with staging is that it does not have (as far
> as I know) a way
> > to continue to notify the buildbot owner.
>
> You mentioned this recently in one of the reviews. With
> https://github.com/llvm/llvm-zorg/commit/3c5b8f5bbc37076036997b3dd8b0137252bcb826
> in place, you can add the tag "silent" to your production builder, and it
> will not send notifications to the blame list. You can set the exact
> notifications you want in the master/config/status.py for that builder.
> Hope this helps you.
>

Fantastic! I'll use this for the next steps for my bots (when I get back to
it, I slacked on this recently...) :)

We may also use this on flaky bots in the future?

Thanks,

-- 
Mehdi

I do not want to have the staging even able to send emails. We debug and
> test many things there, including notifications, and there is always a risk
> of spam.
>
> Thanks
>
> Galina
>
> On Sun, Jan 9, 2022 at 6:07 PM David Blaikie via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> +1 to most of what Mehdi's said here - I'd love to see improvements in
>> stability, though probably having some rigid delegation of responsibility
>> (rather than relying on developers to judge whether it's a flaky test or
>> flaky bot - that isn't always obvious, maybe it's only flaky on a
>> particular configuration that that buildbot happens to test and the
>> developer doesn't have access to - then which is it?) might help (eg: if
>> it's at all unclear, then the assumption is that it's always the test or
>> always the buildbot owner - and an expectation that the author or owner
>> then takes responsibility for working with the other party to address the
>> issue, etc).
>>
>> That all said, disabling individual tests may risk no one caring enough
>> to re-enable them, especially when the flakiness is found long after the
>> change is made that introduced the test or flakiness (usually the case with
>> flakiness - it takes a while to become apparent) - I don't really know how
>> to address that issue. The "convenience" with disabling a buildbot is that
>> there's other value to the buildbot (other than the flaky test that was
>> providing negative value), so buildbot owners have more motivation to get
>> the bot back online - though I don't want to burden buildbot owners unduly
>> either (because they'd eventually give up on them) :/
>>
>> - Dave
>>
>> On Sat, Jan 8, 2022 at 5:15 PM Mehdi AMINI via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Hi,
>>>
>>> First: thanks a lot Stella for being a bot owner and providing valuable
>>> resources to the community. The sequence of even is really unfortunate
>>> here, and thank you for bringing it up to everyone's attention, let's try
>>> to improve our processes.
>>>
>>> On Sat, Jan 8, 2022 at 1:01 PM Philip Reames via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Stella,
>>>>
>>>> Thank you for raising the question.  This is a great discussion for us
>>>> to have publicly.
>>>>
>>>> So folks know, I am the individual Stella mentioned below.  I'll start
>>>> with a bit of history so that everyone's on the same page, then dive into
>>>> the policy question.
>>>>
>>>> My general take is that buildbots are only useful if failure
>>>> notifications are generally actionable.  A couple months back, I was on the
>>>> edge of setting up mail filter rules to auto-delete a bunch of bots because
>>>> they were regularly broken, and decided I should try to be constructive
>>>> first.  In the first wave of that, I emailed a couple of bot owners about
>>>> things which seemed like false positives.
>>>>
>>>> At the time, I thought it was the bot owners responsibility to not be
>>>> testing a flaky configuration.  I got a bit of push back on that from a
>>>> couple sources - Stella was one - and put that question on hold.  This
>>>> thread is a great opportunity to decide what our policy actually is, and
>>>> document it.
>>>>
>>>> In the meantime, I've been working with Galina to document existing
>>>> practice where we could, and to try to identify best practices on setting
>>>> up bots.  These changes have been posted publicly, and reviewed through the
>>>> normal process.  We've been deliberately trying to stick to
>>>> non-controversial stuff as we got the docs improved.  I've been actively
>>>> reaching out to bot owners to gather feedback in this process, but Stella
>>>> had not, yet, been one.
>>>>
>>>> Separately, this week I noticed a bot which was repeatedly toggling
>>>> between red and green.  I forget the exact ratio, but in the recent build
>>>> history, there were multiple transitions, seemingly unrelated to the
>>>> changes being committed.  I emailed Galina asking her to address, and she
>>>> removed the buildbot until it could be moved to the staging buildmaster,
>>>> addressed, and then restored.  I left Stella off the initial email.  Sorry
>>>> about that, no ill intent, just written in a hurry.
>>>>
>>>> Now, transitioning into a bit of policy discussion...
>>>>
>>>> From my conversations with existing bot owners, there is a general
>>>> agreement that bots should only be notifying the community if they are
>>>> stable enough.  There's honest disagreement on what the bar for stable
>>>> enough is, and disagreement about exactly whose responsibility addressing
>>>> new instability is.  (To be clear, I'd separate instability from a clear
>>>> deterministic breakage caused by a commit - we have a lot more agreement on
>>>> that.)
>>>>
>>>> My personal take is that for a bot to be publicly notifying, "someone"
>>>> needs to take the responsibility to backstop the normal revert to green
>>>> process.  This "someone" can be developers who work in a particular area,
>>>> the bot owner, or some combination thereof.  I view the responsibility of
>>>> the bot config owner as being the person responsible for making sure that
>>>> backstopping is happening.  Not necessarily by doing it themselves, but by
>>>> having the contacts with developers who can, and following up when the
>>>> normal flow is not working.
>>>>
>>>> In this particular example, we appear to have a bunch of flaky lldb
>>>> tests.  I personally know absolutely nothing about lldb.  I have no idea
>>>> whether the tests are badly designed, the system they're being run on isn't
>>>> yet supported by lldb, or if there's some recent code bug introduced which
>>>> causes the failure.  "Someone" needs to take the responsibility of figuring
>>>> that out, and in the meantime spaming developers with inactionable failure
>>>> notices seems undesirable.
>>>>
>>>
>>> I generally agree with the overall sentiment. I would add that something
>>> worse differentiating is that the source of flakiness can be coming from
>>> the bot itself (flaky hardware / fragile setup), or from the test/codebase
>>> itself (a flaky bot may just be a deterministic ASAN failure).
>>> Of course from Philip's point of view it does not matter: the effect on
>>> the developer is similar, we get undesirable and unactionable
>>> notifications. From the maintenance flow however, it matters in that the
>>> "someone" who has to take responsibility is often not the same group of
>>> folks.
>>> Also when encountering flaky tests, the best action may not be to
>>> disable the bot itself but instead to disable the test itself! (and file a
>>> bug against the test owner...).
>>>
>>> One more dimension that seems to surface here may be different practices
>>> or expectations across subprojects, for example here the LLDB folks may be
>>> used to having some flaky tests, but they trigger on changes to LLVM
>>> itself, where we may not expect any flakiness (or so).
>>>
>>>
>>>> For context, the bot was disabled until it could be moved to the
>>>> staging buildmaster.  Moving to staging is required (currently) to disable
>>>> developer notification.  In the email from Galina, it seems clear that the
>>>> bot would be fine to move back to production once the issue was triaged.
>>>> This seems entirely reasonable to me.
>>>>
>>>
>>> Something quite annoying with staging is that it does not have (as far
>>> as I know) a way to continue to notify the buildbot owner. I don't really
>>> care about staging vs prod as much as having a mode to just "not notify the
>>> blame list" / "only notify the owner".
>>>
>>> --
>>> Mehdi
>>>
>>>
>>>
>>>> Philip
>>>>
>>>> p.s. One thing I'll note as a definite problem with the current system
>>>> is that a lot of this happens in private email, and it's hard to share so
>>>> that everyone has a good picture of what's going on.  It makes
>>>> miscommunications all too easy.  Last time I spoke with Galina, we were
>>>> tentative planning to start using github issues for bot operation matters
>>>> to address that, but as that was in the middle of the transition from
>>>> bugzilla, we deferred and haven't gotten back to that yet.
>>>>
>>>> p.p.s. The bot in question is
>>>> https://lab.llvm.org/buildbot/#/builders/83 if folks want to examine
>>>> the history themselves.
>>>> On 1/8/22 12:06 PM, Stella Stamenova via llvm-dev wrote:
>>>>
>>>> Hey all,
>>>>
>>>>
>>>>
>>>> I have a couple of questions about what the responsibilities of a
>>>> buildbot owner are. I’ve been maintaining a couple of buildbots for lldb
>>>> and mlir for some time now and I thought I had a pretty good idea of what
>>>> is required based on the documentation here: How To Add Your Build
>>>> Configuration To LLVM Buildbot Infrastructure — LLVM 13 documentation
>>>> <https://www.llvm.org/docs/HowToAddABuilder.html>
>>>>
>>>>
>>>>
>>>> My understanding was that there are some things that are **expected**
>>>> of the owner. Namely:
>>>>
>>>>    1. Make sure that the buildbot is connected and has the right
>>>>    infrastructure (e.g. the right version of Python, or tools, etc.). Update
>>>>    as needed.
>>>>    2. Make sure that the build configuration is one that is supported
>>>>    (e.g. supported flavor or cmake variables). Update as needed.
>>>>
>>>>
>>>>
>>>> There are also a couple of things that are **optional**, but nice to
>>>> have:
>>>>
>>>>    1. If the buildbot stays red for a while (where “a while” is
>>>>    completely subjective), figure out the patch or patches that are causing an
>>>>    issue and either revert them or notify the authors, so they can take action.
>>>>    2. If someone is having trouble investigating a failure that only
>>>>    happens on the buildbot (or the buildbot is a rare configuration), help
>>>>    them out (e.g. collect logs if possible).
>>>>
>>>>
>>>>
>>>> Up to now, I’ve not had any issues with this and the community has been
>>>> very good at fixing issues with builds and tests when I point them out, or
>>>> more often than not, without me having to do anything but the occasional
>>>> test re-run and software update (like this one, for example, ⚙ D114639
>>>> Raise the minimum Visual Studio version to VS2019 (llvm.org)
>>>> <https://reviews.llvm.org/D114639>). lldb has some tests that are
>>>> flaky because of the nature of the product, so there is some noise, but
>>>> mostly things work well and everyone seems happy.
>>>>
>>>>
>>>>
>>>> I’ve recently run into a situation that makes me wonder whether there
>>>> are other expectations of a buildbot owner that are not explicitly listed
>>>> in the llvm documentation. Someone reached out to me some time ago to let
>>>> me know their unhappiness at the flakiness of some of the lldb tests and
>>>> demanded that I either fix them or disable them. I let them know that there
>>>> are some tests that are known to be flaky, that my expectation is that it
>>>> is not my responsibility to fix all such issues and that the community
>>>> would be very happy to have their contribution in the form of a fix or a
>>>> change to disable the tests. I didn’t get a response from this person, but
>>>> I did disable a couple of particularly flaky tests since it seemed like the
>>>> nice thing to do.
>>>>
>>>>
>>>>
>>>> The real excitement happened yesterday when I received an email that **the
>>>> build bot had been turned off**. This same person reached out to the
>>>> powers that be (without letting me know) and asked them explicitly to
>>>> silence it **without my active involvement** because of the flakiness.
>>>>
>>>>
>>>>
>>>> I have a couple of issues with this approach but perhaps I’ve
>>>> misunderstood what my responsibilities are as the buildbot owner. I know it
>>>> is frustrating to see a bot fail because of flaky tests and it is nice to
>>>> have someone to ask to resolve them all – is that really the expectation of
>>>> a buildbot owner? Where is the line between maintenance of the bot and
>>>> fixing build and test issues for the community?
>>>>
>>>>
>>>>
>>>> I’d like to understand what the general expectations are and if there
>>>> are things missing from the documentation, I propose that we add them, so
>>>> that it is clear for everyone what is required.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> -Stella
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing listllvm-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20220112/a15f0c55/attachment-0001.html>