[PATCH] D114325: Add a best practice section on how to configure a fast builder

Mon Nov 22 10:05:42 PST 2021

rengolin added inline comments.

================
Comment at: llvm/docs/HowToAddABuilder.rst:222
+  well, and that having local per-worker caches gets most of the benefit
+  anyways.  We don't currently recommend shared caches.
+
----------------
reames wrote:
> rengolin wrote:
> > Absolutely! Unless you can control everything on all users' environments (toolchain, libs, OS, CMake config), you can't have a generic cache shared amongst builders.
> > 
> > It is possible to have shared builds cache, but the management of such a cache is non-trivial and not the goal of our buildbot infrastructure.
> > 
> > Of course, bot owners can, if they want, set up such a cache, but it's an exercise to the reader, not the general recommendation.
> I don't understand your comment.  The text you replied to was specific to a cache shared between the workers of a single builder, did you maybe think this was broader scoped?  
I did, but tbh, this can also hit the same builder. For example, when the CMake files change from one commit to the next, so two bots on the same builder build with different configurations.

================
Comment at: llvm/docs/HowToAddABuilder.rst:235
+  impacting the broader community.  The sponsoring organization simply
+  has to take on the responsibility of all bisection and triage.
+
----------------
reames wrote:
> mehdi_amini wrote:
> > rengolin wrote:
> > > mehdi_amini wrote:
> > > > For this to be effective, we're missing (I think) the ability to get notified on specific emails for builders that are on staging.
> > > > 
> > > > 
> > > Not really. The staging builder sole purpose is to make sure we *don't* get notified, so the bot and target owners can work in the background, before going public with a noisy infrastructure.
> > > 
> > > The only thing we should do in the staging master is to notify the bot owner, but even that is probably redundant, as the bots only remain in the staging master while they can't be on the production one, so bot owners will be actively looking at them, trying to make them green as soon as possible.
> > > 
> > > In that case, they will probably see the failure before the email hits their inbox.
> > > 
> > > What I mean is that it would be an anti-pattern to turn on notifications here, because bot owners may be encouraged to let their bots stay on the staging master for longer than they should, because they have some level of notification.
> > > 
> > > This is bad because we'll end up with a tiered quality infrastructure, where some bots warn some people, while others warn more people, or no people. This will make it harder for a random developer in the community to know what they break or not and the idea that "all bots should be green" collapses.
> > > Not really. 
> > 
> > It seems to me that you missed the point of this section: it explicitly talks about the possibility that "builders can run indefinitely on the staging buildmaster", as well as "The sponsoring organization simply has to take on the responsibility of all bisection and triage".
> > 
> > > This is bad because we'll end up with a tiered quality infrastructure, where some bots warn some people, while others warn more people, or no people. 
> > 
> > This already exists. There are many bots that are just not on buildbots, for example https://buildkite.com/llvm-project/llvm-main/builds?branch=main or https://buildkite.com/mlir/mlir-core/builds?branch=main 
> > We could also mention Chromium bots, or other people that build downstream for their own reason. 
> > 
> > The only question is should this "second tiered" infrastructure be hosted by our infrastructure (could be staging bot, could be a silent mode on the production buildbot) or whether you just prefer that these folks stay on their own infra (which does not change much in practice: I'm reverting patches that break my buildkite bots...).
> > 
> > 
> > Finally, I have bots on staging for a couple of months, and haven't been confident to migrate them to prod (busy with other stuff) and I don't spend much time monitoring them. In absence of email on failures I can't build the confidence to migrate them, so they sit there for now... I would very much appreciated being notified on these bots instead of having to poll every day to figure out if they are stable enough!
> > 
> The capability to have a bot which notifies only the maintainer has come up multiple times.  I think we definitely need to document how to do that.  I believe we already can, but checking my notes, I didn't write down how.  Will pay attention to this in next round of conversation and see if I can figure out how to do this with current infrastructure.
> 
> I vaguely remember it being a mode on the main buildmaster, not staging, but will have to confirm.  
> It seems to me that you missed the point of this section: it explicitly talks about the possibility that "builders can run indefinitely on the staging buildmaster"

I did not.

Some bot owners decide to keep their bots on the staging master with the knowledge that no one will know when they break, and that's the cost they pay.

> This already exists. There are many bots that are just not on buildbots, for example https://buildkite.com/llvm-project/llvm-main/builds?branch=main or https://buildkite.com/mlir/mlir-core/builds?branch=main. We could also mention Chromium bots, or other people that build downstream for their own reason.

None of those bots are "official", so it's not the same thing. People can setup whatever they want wherever they can, but that doesn't mean we (the rest of the community) have to care about their infrastructure.

Some infrastructure has been accepted, for example the Green bots, but that was only after proving that they have a stable infrastructure to begin with, providing the same level of quality we have from the main bots.

> The only question is should this "second tiered" infrastructure be hosted by our infrastructure (could be staging bot, could be a silent mode on the production buildbot) or whether you just prefer that these folks stay on their own infra (which does not change much in practice: I'm reverting patches that break my buildkite bots...).

I think you're mixing things... Staging bots are silent bots, production bots are noisy bots. There isn't much else different between them. So, silent production bot makes no sense.

Reverting patches because an unofficial bot breaks, without any warning, seems wrong to me. If those bots are important to the community, I suggest you follow the same path as Apple did with the Green bots and get those bots sanctioned to act as production/staging, so that everyone can see and be notified on the stable breakages.

There should be no second tier, in the same way that there should be no "yellow" bots (ie. known broken, in production). It's either green or red. Production red means genuine failure or it must move to the staging master.

We've been operating under that assumption for more than a decade and I think we shouldn't relax the rules just because it's difficult to maintain those promises.

> Finally, I have bots on staging for a couple of months, and haven't been confident to migrate them to prod (busy with other stuff) and I don't spend much time monitoring them. In absence of email on failures I can't build the confidence to migrate them, so they sit there for now... I would very much appreciated being notified on these bots instead of having to poll every day to figure out if they are stable enough!

This is what we were discussing about mailing the bot owner only. I think there's an option in buildbot to do that.

When I was managing the Arm bots, I created a webpage that queried the bots every 5 minutes and displayed a grid of green/red statuses and links to the failed builds. This was a way to have a clean interface with only the bots I cared about.

Most of them were in the production master, but I still wanted to know if people were acting on their emails. Some were on the staging master, and that was an ok way to check up on them, at the time.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114325/new/

https://reviews.llvm.org/D114325