[llvm-dev] Buildbot Noise

Thu Oct 1 10:31:56 PDT 2015

Folks,

David has been particularly militant with broken buildbots recently,
so to make sure we don't throw the baby with the bath water, I'd like
to propose some changes on how we deal with the emails on our
*current* buildmaster, since there's no concrete plans to move it to
anything else at the moment.

The main issue is that managing the buildbots is not a simple task. It
requires build owners to disable on the slave side, or specific people
on the master side. The former can take as long as the owner wants
(which is not nice), and the latter refreshes all active bots
(triggering exceptions) and are harder to revert.

We need to be pragmatic without re-writing the BuildBot product.

Grab some popcorn...

There are two main fronts that we need to discuss the noise: Bot and
test stability.

   1. Bot stability issues

We need to distinguish between four classes of buildbots:

  1.1. Fast && stable && green

These buildbots normally finish under one hour, but most of the time
under 1/2 hour and should be kept green as much as possible.
Therefore, any reasonable noise in these bots are welcomed, since we
want them to go back to green as soon as possible.

They're normally the front-line, and usually catch most of the silly
bugs. But we need some kind of policy that allows us to revert patches
that break them for more than a few hours. We have an agreement
already, and for me that's good enough. People might think
differently.

With the items 2.x below taken care of, we should keep the current
state of our bots for this group.

  1.2. One of: Slow || Unstable || Often Red

These bots are special. They're normally *very* important, but have
some issues, like slow hardware, not too many available boards, or
they take long times to bisect and fix the bugs.

These bots catch the *serious* bugs, like self-hosted Clang
mis-compiling a long-running test which sometimes fails. They can
produce noise, but when the noise is correct, we really need to listen
to it. Writing software to understand that is non-trivial.

So, the idea here is to have a few special treatments for each type of
problem. For example, slow bots need more hardware to reduce the blame
list. Unstable bot need more work to reduce spurious noise to a
minimum (see 2.x below), and red bots *must* remain *silent* until
they come back to green (see 2.x below).

What we *don't* want is to disable or silence them after they're
green. Most of the bugs they find are hard to debug, so the longer we
take to fix it the harder it is to find out what happened. We need to
know as soon as possible when they break.

  1.3. Two of: Slow || Unstable || Often Red

These bots are normally only important to their owners, and they are
on the verge of being disabled. The only way to cope with those bots
is to completely disable their emails / IRC messages, so that no one
gets flooded with noise from broken bots.

However, some bots on the 1.2 category fall into this one for short
periods of time (~1 week), so we need to be careful with what we
disable here. That's the key baby/bathwater issue.

Any hard policy here will be wrong for some bots some of the time, so
I'd love if we could all just trust the bot owners a bit when they say
they're fixing the issue. However, bots that fall here for more than a
month, or more often that a few times during a few months (I'm being
vague on purpose), then we collectively decide to disable the bot.

What I *don't* want is any two or three guys deciding to disable the
buildbot of someone else because they can't stand the noise. Remember,
people do take holidays once in a while, and they may be in the Amazon
or the Sahara having well deserved rest. Returning to work and
learning that all your bots are disabled for a week is not nice.

So far, we have coped with noise, and the result is that people tend
to ignore those bots, which means more work to the bot owner. This is
not a good situation, and we want to move away from it, but we
shouldn't flip all switches off by default. We can still be pragmatic
about this as long as we improve the quality overall (see 2.x below)
with time.

In summary, bots that fall here for too long will have their emails
disabled and candidates for removal in the next spring clean-up, but
not immediately.

  1.4. Slow && Unstable && Red

These bots don't belong here. They should be moved elsewhere,
preferably to a local buildmaster that you can control and that will
never email people or upset our master if you need changes. I have
such a local master myself and it's very easy to setup and maintain.

They *do* have value to *you*, for example to show the progress of
your features cleaning up the failures, or generating some benchmark
numbers, but that's something that is very specific to your project
and should remain separated.

Any of these bots in LLVM Lab should be moved away / removed, but on
consensus, including the bot owner if he/she is still available in the
list.

   2. Test stability issues

These issues, as you may have noticed from the links above, apply to
*all* bots. The less noise we have overall, the lower will be our
threshold for kicking bots out of the critical pool, and the higher
the value of the not-so-perfect buildbots to the rest of the
community.

  2.1 Failed vs Exception

The most critical issue we have to fix is the "red -> exception ->
red" issue. Basically, a bot is red (because you're still
investigating the problem), then someone restarts the master, so you
get an exception. The next build will be a failure, and the
buildmaster recognises the status change and emails everyone. That's
just wrong.

We need to add an extra check to that logic where it searches down for
the next non-exceptional status and compares to that, not just the
immediately previous result.

This is a no-brainer and I don't think anyone would be against it. I
just don't know where this is done, I welcome the knowledge of more
experienced folks.

  2.2 Failure types

The next obvious thing is to detect what the error is. If it's an SVN
error, we *really* don't need to get an email. But this raises the
problem that an SVN failure followed by a genuine failure will not be
reported. So, the reporting mechanism also has to know what's the
previously *reported* failure, not just the previous failure.

Other failures, like timeout, can be either flaky hardware or broken
codegen. A way to be conservative and low noise would be to only warn
on timeouts IFF it's the *second* in a row.

For all these adjustments, we'll need some form of walk-back on the
history to find the previous genuine result, and we'll need to mark
results with some metadata. This may involve some patches to buildbot.

  2.3 Detecting fixed bots

Another interesting feature, that is present in the "GreenBot" is a
warning when a bot you broke was fixed. That, per se, is not a good
idea if the noise levels are high, since this will probably double it.

So, this feature can only be introduced *after* we've done the clean
ups above. But once it's clean, having a "green" email will put the
minds of everyone that haven't seen the "red" email yet to rest, as
they now know they don't even need to look at it at all, just delete
the email.

For those using fetchmail, I'm sure you could create a rule to do that
automatically, but that's optional. :)

  2.4 Detecting new failures

This is a wish-list that I have, for the case where the bots are slow
and hard to debug and are still red. Assuming everything above is
fixed, they will emit no noise until they go green again, however,
while I'm debugging the first problem, others can appear. If that
happens, *I* want to know, but not necessarily everyone else.

So, a list of problems reported could be added to the failure report,
and if the failure is different, the bot owner gets an email. This
would have to play nice with exception statuses, as well as spurious
failures like SVN or timeouts, so it's not an easy patch.

The community at large would be already happy with all the changes
minus this one, but folks that have to maintain slow hardware like me
would appreciate this feature. :)

Does any one have more concerns?

AFAICS, we should figure out where the walk-back code needs to be
inserted and that would get us 90% of the way. The other 10% will be
to list all the buildbots, check their statuses, owners, and map into
those categories, and take the appropriate action.

Maybe we should also reduce the noise in the IRC channel further (like
only first red, first green), but that's not my primary concern right
now. Feel free to look into it if it is for you.

cheers,
--renato