[PATCH] D114325: Add a best practice section on how to configure a fast builder

Mon Nov 22 11:46:00 PST 2021

reames added inline comments.

================
Comment at: llvm/docs/HowToAddABuilder.rst:235
+  impacting the broader community.  The sponsoring organization simply
+  has to take on the responsibility of all bisection and triage.
+
----------------
rengolin wrote:
> mehdi_amini wrote:
> > rengolin wrote:
> > > reames wrote:
> > > > mehdi_amini wrote:
> > > > > rengolin wrote:
> > > > > > mehdi_amini wrote:
> > > > > > > For this to be effective, we're missing (I think) the ability to get notified on specific emails for builders that are on staging.
> > > > > > > 
> > > > > > > 
> > > > > > Not really. The staging builder sole purpose is to make sure we *don't* get notified, so the bot and target owners can work in the background, before going public with a noisy infrastructure.
> > > > > > 
> > > > > > The only thing we should do in the staging master is to notify the bot owner, but even that is probably redundant, as the bots only remain in the staging master while they can't be on the production one, so bot owners will be actively looking at them, trying to make them green as soon as possible.
> > > > > > 
> > > > > > In that case, they will probably see the failure before the email hits their inbox.
> > > > > > 
> > > > > > What I mean is that it would be an anti-pattern to turn on notifications here, because bot owners may be encouraged to let their bots stay on the staging master for longer than they should, because they have some level of notification.
> > > > > > 
> > > > > > This is bad because we'll end up with a tiered quality infrastructure, where some bots warn some people, while others warn more people, or no people. This will make it harder for a random developer in the community to know what they break or not and the idea that "all bots should be green" collapses.
> > > > > > Not really. 
> > > > > 
> > > > > It seems to me that you missed the point of this section: it explicitly talks about the possibility that "builders can run indefinitely on the staging buildmaster", as well as "The sponsoring organization simply has to take on the responsibility of all bisection and triage".
> > > > > 
> > > > > > This is bad because we'll end up with a tiered quality infrastructure, where some bots warn some people, while others warn more people, or no people. 
> > > > > 
> > > > > This already exists. There are many bots that are just not on buildbots, for example https://buildkite.com/llvm-project/llvm-main/builds?branch=main or https://buildkite.com/mlir/mlir-core/builds?branch=main 
> > > > > We could also mention Chromium bots, or other people that build downstream for their own reason. 
> > > > > 
> > > > > The only question is should this "second tiered" infrastructure be hosted by our infrastructure (could be staging bot, could be a silent mode on the production buildbot) or whether you just prefer that these folks stay on their own infra (which does not change much in practice: I'm reverting patches that break my buildkite bots...).
> > > > > 
> > > > > 
> > > > > Finally, I have bots on staging for a couple of months, and haven't been confident to migrate them to prod (busy with other stuff) and I don't spend much time monitoring them. In absence of email on failures I can't build the confidence to migrate them, so they sit there for now... I would very much appreciated being notified on these bots instead of having to poll every day to figure out if they are stable enough!
> > > > > 
> > > > The capability to have a bot which notifies only the maintainer has come up multiple times.  I think we definitely need to document how to do that.  I believe we already can, but checking my notes, I didn't write down how.  Will pay attention to this in next round of conversation and see if I can figure out how to do this with current infrastructure.
> > > > 
> > > > I vaguely remember it being a mode on the main buildmaster, not staging, but will have to confirm.  
> > > > It seems to me that you missed the point of this section: it explicitly talks about the possibility that "builders can run indefinitely on the staging buildmaster"
> > > 
> > > I did not.
> > > 
> > > Some bot owners decide to keep their bots on the staging master with the knowledge that no one will know when they break, and that's the cost they pay.
> > > 
> > > > This already exists. There are many bots that are just not on buildbots, for example https://buildkite.com/llvm-project/llvm-main/builds?branch=main or https://buildkite.com/mlir/mlir-core/builds?branch=main. We could also mention Chromium bots, or other people that build downstream for their own reason.
> > > 
> > > None of those bots are "official", so it's not the same thing. People can setup whatever they want wherever they can, but that doesn't mean we (the rest of the community) have to care about their infrastructure.
> > > 
> > > Some infrastructure has been accepted, for example the Green bots, but that was only after proving that they have a stable infrastructure to begin with, providing the same level of quality we have from the main bots.
> > > 
> > > > The only question is should this "second tiered" infrastructure be hosted by our infrastructure (could be staging bot, could be a silent mode on the production buildbot) or whether you just prefer that these folks stay on their own infra (which does not change much in practice: I'm reverting patches that break my buildkite bots...).
> > > 
> > > I think you're mixing things... Staging bots are silent bots, production bots are noisy bots. There isn't much else different between them. So, silent production bot makes no sense.
> > > 
> > > Reverting patches because an unofficial bot breaks, without any warning, seems wrong to me. If those bots are important to the community, I suggest you follow the same path as Apple did with the Green bots and get those bots sanctioned to act as production/staging, so that everyone can see and be notified on the stable breakages.
> > > 
> > > There should be no second tier, in the same way that there should be no "yellow" bots (ie. known broken, in production). It's either green or red. Production red means genuine failure or it must move to the staging master.
> > > 
> > > We've been operating under that assumption for more than a decade and I think we shouldn't relax the rules just because it's difficult to maintain those promises.
> > > 
> > > > Finally, I have bots on staging for a couple of months, and haven't been confident to migrate them to prod (busy with other stuff) and I don't spend much time monitoring them. In absence of email on failures I can't build the confidence to migrate them, so they sit there for now... I would very much appreciated being notified on these bots instead of having to poll every day to figure out if they are stable enough!
> > > 
> > > This is what we were discussing about mailing the bot owner only. I think there's an option in buildbot to do that.
> > > 
> > > When I was managing the Arm bots, I created a webpage that queried the bots every 5 minutes and displayed a grid of green/red statuses and links to the failed builds. This was a way to have a clean interface with only the bots I cared about.
> > > 
> > > Most of them were in the production master, but I still wanted to know if people were acting on their emails. Some were on the staging master, and that was an ok way to check up on them, at the time.
> > I don't understand or agree with your opinion, I obviously have a very different view from you about the dynamic of how everything is setup..
> > But I'm not sure we have to argue about it though: many people have bots like that, and they contribute significantly to the project by reverting patches that breaks something that isn't tested upstream for some reasons.
> > I don't quite get why we should force people to make their bots noisy: they take on the maintenance for the benefit of everyone!
> > 
> > > Reverting patches because an unofficial bot breaks, without any warning, seems wrong to me.
> > 
> > How so?
> > How different is it that a revert because of a miscompile we detect while testing a downstream project? For example an LLVM commit in an optimization introduces a miscompile in the Rust compiler, or a miscompile with Clang when building Chromium: we consistently revert such cases as well.
> > I don't quite get why we should force people to make their bots noisy: they take on the maintenance for the benefit of everyone!
> 
> That's not what I proposed, just that it's either noisy (every one gets notified) or it's not (no one gets notified). Production / Staging.
> 
> Notifying the bot owner on staging seems a good thing to me, too.
> 
> > How different is it that a revert because of a miscompile we detect while testing a downstream project? 
> 
> Upstream bots can be more aggressive on reverts, even before notifying authors, because they build every LLVM commit and notify authors of breakages. Those are accepted by the community to have that privilege because their owners (should) have agreed on a minimum service level.
> 
> Staging/secondary bots may build every commit but they're silent and the cost is on admins, not the community. Those should work with authors before reverting their patches, regardless of what they build. If they want to be more pro-active, they should be part of the "upstream" bundle (and pay the cost).
> 
> Downstream projects (chromium, rust, android, company internal, etc.) can delay rebasing if there's a bug that needs fixing, and work with authors to fix the issue upstream.
> 
> If the above split is muddled, it shouldn't be a reason for continue doing it, but a reason to stop doing it and move to production if you want the same behaviour. Otherwise, it'd push the effort cost into the community and that's not fair.
> 
I suspect this discussion should move to another forum - e.g. llvm-dev - as it's gotten a bit off topic here.

My personal take is that reverts with a day or two of commit *from any source* are perfectly fine *provided* a reproduced test case is provided.  I think we even document that in the developer policy.  To me, the major question here is around "when should we notify", not "when should we revert".

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114325/new/

https://reviews.llvm.org/D114325