[llvm-dev] [EXTERNAL] Re: Responsibilities of a buildbot owner

Thu Jan 13 13:41:55 PST 2022

There are a couple of things on this thread that sound nice in general, but have not been clarified either in the discussion or in the documentation. Since the devil is in the details, I’d like to see us agree on the details and then have them added to the documentation.

At the end of the day, there should be no surprises in the process and everything that can be should be quantified.

We want to encourage people to be responsible code and buildbot owners, not discourage them from contributing at all.

> It is expected that buildbot owners own bots which are reliable, informative and helpful to the community.

In my experience, every buildbot has occasional “flakiness” – be it because of code failures that don’t happen every time or because of connectivity issues, etc. Some bots are also often broken not because of any flakiness, but because with the large number of commits, there are bound to be failures.

So what makes a bot not reliable enough? Some percentage of builds failing? Some percentage of false positives? Does it vary per project or is there a single expectation for all of llvm?

I think it makes sense to say that false positives above a certain threshold make a buildbot not reliable enough and the threshold should be documented. It also makes sense to say that failures above a certain threshold make a bot not reliable enough – if the codebase is fragile enough that most commits cause breaks, it is possible that a reliable buildbot for it cannot exist.

>  "someone" needs to take the responsibility to backstop the normal revert to green process.

As Mehdi pointed out earlier, the root cause of the failure might mean that the buildbot owner or that a code owner is better suited to addressing it. Philip’s argument is that at the end of the day, it is always the buildbot owner if a code owner hasn’t come forward. It makes sense to have someone who is ultimately responsible and it also makes sense that everyone needs to be given time and notice to act on the failures.

There has also been some mention of different ways to “silence” a buildbot – either by turning it off entirely and waiting for a bot owner to reconnect it to staging or production, or by tagging it as “silent”. In my experience, there’s a huge difference between using the “silent” tag and turning a bot off. In the first case, the bot owners will continue to receive notifications and the builds will continue to run. Even if the bot is red already, there’s some chance that new commits that add breaks will be possible to figure out by looking at the logs either by other interested parties, or by the bot owners themselves. When a bot is turned off for any period of time, there’s nothing that can be used to determine when new failures were checked in (aside from local builds, so many local builds) and it can be incredibly painful to track down. I think bots should only be forcefully turned off very rarely and when nothing else can be done and with plenty of notice.

So then, what is the flow when a bot starts having issues? I would propose that it be something like this:

  1.  Code owners have to address issues in X amount of time.
  2.  If the code owners has failed to address the situation, it falls to the buildbot owners. Perhaps at the beginning or in the middle of this period, the bot owners get an email that says: “Hey, so and so, we’re close to tagging the bot “silent”, can you have a look?”
  3.  If both the code owners and the buildbot owners have failed to address the situation, the bot gets tagged “silent”. The buildbot owner gets notified that this happened and the notification spells out how much longer they have before the bot gets turned off.
  4.  If both the code owners and the buildbot owners have failed to address the situation for some time longer, the bot gets turned off.

Each of this steps should be allowed a pre-determined amount of time. A few hours? A few days? Ideally, each of the transitions (but definitely 2->3->4) come with notifications. If it was possible for a bot to be moved to staging automatically, we could even have an extra step where it gets moved to staging before it gets turned off. I don’t think that’s currently possible though.

> The main problem with flaky tests is random false blames. People get annoyed and stop paying attention to failures on a particular builder, and other builders as well, arguing that build bot in general is not reliable.

Galina made a good point to me that people get annoyed by failures and stop paying attention to all buildbots. I can see how flaky tests/bots contribute to the general ignoring of the buildbots, but I would argue that the root cause is the sheer volume of build breaks that are not the fault of a committer. The few times I’ve made commits to llvm, for example, I’ve always gotten at least one email about a break that was unrelated to my change (because my changes are perfect, thank you very much). This larger problem of build breaks is much harder to address than flaky bots or tests, but I think would improve the health of llvm & friends significantly more (and in the meantime, we could tolerate some “flakiness”).

Thanks,
-Stella

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Galina Kistanova via llvm-dev
Sent: Wednesday, January 12, 2022 11:24 PM
To: Mehdi AMINI <joker.eph at gmail.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Subject: [EXTERNAL] Re: [llvm-dev] Responsibilities of a buildbot owner

> We may also use this on flaky bots in the future?

Yes, we may.
Or we may try to do our best to fix them. :)

Moving workers to the staging temporarily to investigate and address an issue is fine. Gives a bit more elbow room for experimenting, as we can apply experimental patches there, restart the staging as needed and often, and so on. Which is not the case with the production. It does not take much effort to move a worker between the staging and the production areas - a simple edit of the buildbot.tac file and a worker restart.

Tagging a builder "silent" means there is a designated person or a team who is actively fixing the detected issues or acting as a proxy to handle the blame list. This could be a way to dial with flaky bots, indeed, assuming there is somebody taking care of those builders, not just a way to skip the annoyance and keep the status quo.

By the way, thanks everyone for the constructive and polite discussion! It seems we are going to have a more stable and informative Windows LLDB builder.

Galina

On Wed, Jan 12, 2022 at 9:19 PM Mehdi AMINI <joker.eph at gmail.com<mailto:joker.eph at gmail.com>> wrote:

On Wed, Jan 12, 2022 at 7:33 PM Galina Kistanova <gkistanova at gmail.com<mailto:gkistanova at gmail.com>> wrote:
Hello everyone,

In continuation of the Responsibilities of a buildbot owner thread.

First of all, thank you very much for being buildbot owners! This is much appreciated.
Thank you for bringing good points to the discussion.

It is expected that buildbot owners own bots which are reliable, informative and helpful to the community.

Effectively that means if a problem is detected by a builder and it is hard to pinpoint the reason of the issue and a commit to blame, a buildbot owner is natively on the escalation path. Someone has to get to the root of the problem and fix it one way or another (by reverting the commit, or by proposing a patch, or by working with the author of the commit which introduced the issue). In the majority of the cases someone takes care of an issue. But sometimes it takes a buildbot owner to push. Every buildbot owner does this from time to time.

Hi Mehdi,

> Something quite annoying with staging is that it does not have (as far as I know) a way
> to continue to notify the buildbot owner.

You mentioned this recently in one of the reviews. With https://github.com/llvm/llvm-zorg/commit/3c5b8f5bbc37076036997b3dd8b0137252bcb826<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fllvm%2Fllvm-zorg%2Fcommit%2F3c5b8f5bbc37076036997b3dd8b0137252bcb826&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=L2N0O2%2FoSTSXv8wPViTIQPuGZGJqQ08D28mgubIhVLE%3D&reserved=0> in place, you can add the tag "silent" to your production builder, and it will not send notifications to the blame list. You can set the exact notifications you want in the master/config/status.py for that builder. Hope this helps you.

Fantastic! I'll use this for the next steps for my bots (when I get back to it, I slacked on this recently...) :)

We may also use this on flaky bots in the future?

Thanks,

--
Mehdi

I do not want to have the staging even able to send emails. We debug and test many things there, including notifications, and there is always a risk of spam.

Thanks

Galina

On Sun, Jan 9, 2022 at 6:07 PM David Blaikie via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
+1 to most of what Mehdi's said here - I'd love to see improvements in stability, though probably having some rigid delegation of responsibility (rather than relying on developers to judge whether it's a flaky test or flaky bot - that isn't always obvious, maybe it's only flaky on a particular configuration that that buildbot happens to test and the developer doesn't have access to - then which is it?) might help (eg: if it's at all unclear, then the assumption is that it's always the test or always the buildbot owner - and an expectation that the author or owner then takes responsibility for working with the other party to address the issue, etc).

That all said, disabling individual tests may risk no one caring enough to re-enable them, especially when the flakiness is found long after the change is made that introduced the test or flakiness (usually the case with flakiness - it takes a while to become apparent) - I don't really know how to address that issue. The "convenience" with disabling a buildbot is that there's other value to the buildbot (other than the flaky test that was providing negative value), so buildbot owners have more motivation to get the bot back online - though I don't want to burden buildbot owners unduly either (because they'd eventually give up on them) :/

- Dave

On Sat, Jan 8, 2022 at 5:15 PM Mehdi AMINI via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Hi,

First: thanks a lot Stella for being a bot owner and providing valuable resources to the community. The sequence of even is really unfortunate here, and thank you for bringing it up to everyone's attention, let's try to improve our processes.

On Sat, Jan 8, 2022 at 1:01 PM Philip Reames via llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

Stella,

Thank you for raising the question.  This is a great discussion for us to have publicly.

So folks know, I am the individual Stella mentioned below.  I'll start with a bit of history so that everyone's on the same page, then dive into the policy question.

My general take is that buildbots are only useful if failure notifications are generally actionable.  A couple months back, I was on the edge of setting up mail filter rules to auto-delete a bunch of bots because they were regularly broken, and decided I should try to be constructive first.  In the first wave of that, I emailed a couple of bot owners about things which seemed like false positives.

At the time, I thought it was the bot owners responsibility to not be testing a flaky configuration.  I got a bit of push back on that from a couple sources - Stella was one - and put that question on hold.  This thread is a great opportunity to decide what our policy actually is, and document it.

In the meantime, I've been working with Galina to document existing practice where we could, and to try to identify best practices on setting up bots.  These changes have been posted publicly, and reviewed through the normal process.  We've been deliberately trying to stick to non-controversial stuff as we got the docs improved.  I've been actively reaching out to bot owners to gather feedback in this process, but Stella had not, yet, been one.

Separately, this week I noticed a bot which was repeatedly toggling between red and green.  I forget the exact ratio, but in the recent build history, there were multiple transitions, seemingly unrelated to the changes being committed.  I emailed Galina asking her to address, and she removed the buildbot until it could be moved to the staging buildmaster, addressed, and then restored.  I left Stella off the initial email.  Sorry about that, no ill intent, just written in a hurry.

Now, transitioning into a bit of policy discussion...

From my conversations with existing bot owners, there is a general agreement that bots should only be notifying the community if they are stable enough.  There's honest disagreement on what the bar for stable enough is, and disagreement about exactly whose responsibility addressing new instability is.  (To be clear, I'd separate instability from a clear deterministic breakage caused by a commit - we have a lot more agreement on that.)

My personal take is that for a bot to be publicly notifying, "someone" needs to take the responsibility to backstop the normal revert to green process.  This "someone" can be developers who work in a particular area, the bot owner, or some combination thereof.  I view the responsibility of the bot config owner as being the person responsible for making sure that backstopping is happening.  Not necessarily by doing it themselves, but by having the contacts with developers who can, and following up when the normal flow is not working.

In this particular example, we appear to have a bunch of flaky lldb tests.  I personally know absolutely nothing about lldb.  I have no idea whether the tests are badly designed, the system they're being run on isn't yet supported by lldb, or if there's some recent code bug introduced which causes the failure.  "Someone" needs to take the responsibility of figuring that out, and in the meantime spaming developers with inactionable failure notices seems undesirable.

I generally agree with the overall sentiment. I would add that something worse differentiating is that the source of flakiness can be coming from the bot itself (flaky hardware / fragile setup), or from the test/codebase itself (a flaky bot may just be a deterministic ASAN failure).
Of course from Philip's point of view it does not matter: the effect on the developer is similar, we get undesirable and unactionable notifications. From the maintenance flow however, it matters in that the "someone" who has to take responsibility is often not the same group of folks.
Also when encountering flaky tests, the best action may not be to disable the bot itself but instead to disable the test itself! (and file a bug against the test owner...).

One more dimension that seems to surface here may be different practices or expectations across subprojects, for example here the LLDB folks may be used to having some flaky tests, but they trigger on changes to LLVM itself, where we may not expect any flakiness (or so).

For context, the bot was disabled until it could be moved to the staging buildmaster.  Moving to staging is required (currently) to disable developer notification.  In the email from Galina, it seems clear that the bot would be fine to move back to production once the issue was triaged.  This seems entirely reasonable to me.

Something quite annoying with staging is that it does not have (as far as I know) a way to continue to notify the buildbot owner. I don't really care about staging vs prod as much as having a mode to just "not notify the blame list" / "only notify the owner".

--
Mehdi

Philip

p.s. One thing I'll note as a definite problem with the current system is that a lot of this happens in private email, and it's hard to share so that everyone has a good picture of what's going on.  It makes miscommunications all too easy.  Last time I spoke with Galina, we were tentative planning to start using github issues for bot operation matters to address that, but as that was in the middle of the transition from bugzilla, we deferred and haven't gotten back to that yet.

p.p.s. The bot in question is https://lab.llvm.org/buildbot/#/builders/83<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=TR4I18%2FuGHgNwK0PZprdHwg9gVikDWaUWEIXqDU5EQo%3D&reserved=0> if folks want to examine the history themselves.
On 1/8/22 12:06 PM, Stella Stamenova via llvm-dev wrote:
Hey all,

I have a couple of questions about what the responsibilities of a buildbot owner are. I’ve been maintaining a couple of buildbots for lldb and mlir for some time now and I thought I had a pretty good idea of what is required based on the documentation here: How To Add Your Build Configuration To LLVM Buildbot Infrastructure — LLVM 13 documentation<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.llvm.org%2Fdocs%2FHowToAddABuilder.html&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vmuwSe4aJvyZaseAsRONqnwQT5AE2j8Fsey6n2X8aow%3D&reserved=0>

My understanding was that there are some things that are *expected* of the owner. Namely:

  1.  Make sure that the buildbot is connected and has the right infrastructure (e.g. the right version of Python, or tools, etc.). Update as needed.
  2.  Make sure that the build configuration is one that is supported (e.g. supported flavor or cmake variables). Update as needed.

There are also a couple of things that are *optional*, but nice to have:

  1.  If the buildbot stays red for a while (where “a while” is completely subjective), figure out the patch or patches that are causing an issue and either revert them or notify the authors, so they can take action.
  2.  If someone is having trouble investigating a failure that only happens on the buildbot (or the buildbot is a rare configuration), help them out (e.g. collect logs if possible).

Up to now, I’ve not had any issues with this and the community has been very good at fixing issues with builds and tests when I point them out, or more often than not, without me having to do anything but the occasional test re-run and software update (like this one, for example, ⚙ D114639 Raise the minimum Visual Studio version to VS2019 (llvm.org)<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Freviews.llvm.org%2FD114639&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ppf4tXWRAK7cf68FMTvaZqIQhkCelgDJKOrkbrhUST4%3D&reserved=0>). lldb has some tests that are flaky because of the nature of the product, so there is some noise, but mostly things work well and everyone seems happy.

I’ve recently run into a situation that makes me wonder whether there are other expectations of a buildbot owner that are not explicitly listed in the llvm documentation. Someone reached out to me some time ago to let me know their unhappiness at the flakiness of some of the lldb tests and demanded that I either fix them or disable them. I let them know that there are some tests that are known to be flaky, that my expectation is that it is not my responsibility to fix all such issues and that the community would be very happy to have their contribution in the form of a fix or a change to disable the tests. I didn’t get a response from this person, but I did disable a couple of particularly flaky tests since it seemed like the nice thing to do.

The real excitement happened yesterday when I received an email that *the build bot had been turned off*. This same person reached out to the powers that be (without letting me know) and asked them explicitly to silence it *without my active involvement* because of the flakiness.

I have a couple of issues with this approach but perhaps I’ve misunderstood what my responsibilities are as the buildbot owner. I know it is frustrating to see a bot fail because of flaky tests and it is nice to have someone to ask to resolve them all – is that really the expectation of a buildbot owner? Where is the line between maintenance of the bot and fixing build and test issues for the community?

I’d like to understand what the general expectations are and if there are things missing from the documentation, I propose that we add them, so that it is clear for everyone what is required.

Thanks,
-Stella

_______________________________________________

LLVM Developers mailing list

llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>

https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0>
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0>
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0>
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20220113/45331446/attachment.html>