[llvm-dev] Buildbot Noise

Fri Oct 2 06:49:05 PDT 2015

On 1 October 2015 at 23:01, Philip Reames <listmail at philipreames.com> wrote:
> "unstable" these should be removed immediately.  If the failure rate is more
> than 1 in 5 builds of a known clean revision, that's far too much noise to
> be notifying.  To be clear, I'm specifically referring to spurious
> *failures* not environmental factors which are global to all bots.

There are some bugs that introduce intermittent behaviour, and it
would be very bad if we just disabled the bots that warned us about
them. Some genuine bugs in Clang or the sanitizers can come and go if
they depend on where the objects are stored in memory, or if the block
happens to be aligned or not.

One example is Clang's inability to cope with alignment when using its
own version of placement new for derived classes. Our ARM bots have
been warning on them for more than a year and we have fixed most of
them. If we had disabled the ARM bots the first time it became
"unstable", we would still have those problems and we wouldn't be
testing on ARM any more. Two very bad outcomes.

We have to protect ourselves from assuming too much, too early.

> "often red" these are extremely valuable (msan, etc..).  Assuming we only
> notify on green->red, the only rule we should likely enforce is that each
> bot has been green "recently".  I'd suggest a threshold of 2 months.  If it
> hasn't been green in 2 months, it's not really a build bot.

This is the case I make on 1.4. If the bot is assumed to be red
because someone is slowly fixing its problems, than this bot belongs
to a separate buildmaster.

However, slow bots tend to be red for longer periods, not necessarily
for longer number of builds. OTOH, fast bots can be red for a very
large number of builds, but immediately green when a revert is
applied. So we need to be careful on timings here.

> So, maybe I'm missing something, but: why is it any harder to bring a
> silence bot green than an emailing one?

It's not. But keeping it green later is, because it takes time to
change the buildmaster. For obvious reasons, not all of us have access
to the buildmaster, meaning we depend on the few people that work on
it directly to move things around.

By adding the uncertainty of commits breaking the build to the
uncertainty of when the master will be updated, you can easily fall
into a deadlock. I have been in situations when in the period of two
weeks I had to bring one bot from red to green 5 times. If in between
someone put that bot to not warn, it could have taken me more time to
realise, and every new failure on top of the original makes the
process non-linearly more complex, especially if whoever fixed the bot
is committing loads of patches to try to fix the mess. Reverting two
sequences of intercalated patches independently is more than twice
harder than one sequence, and so on.

I think if we had different public masters, and if the bot owner had
the responsibility to move between them, that could work well, since
moving masters is in the owners' power, while moving groups in the
master is not.

We can then leave the decision of disabling the bot in the master for
more radical solutions, when the bot owner is not responsive or
uncooperative.

cheers,
--renato