[llvm-dev] Buildbot Noise

Thu Oct 1 11:38:21 PDT 2015

Hi Renato,

Very useful thoughts, thanks. Need to think what could be done about these.

I will add few comments from my side.

Buildmaster as is configured now should send notifications on status change
only for
'successToFailure' and 'failureToSuccess' events, so always red bots should
be quiet.

Also we have group of builders (experimental_scheduled_builders) in
configuration file builders.py which also should be quiet. This is place
for noisy unstable bots.

If these features are not working properly please let me know and I will
also try to watch these.

Unfortunately buildbot currently does not distinguish test and build
failures.

I am going to be away on vacation the whole next week, but will keep an eye
on buildbot.

Thanks

Galina

On Thu, Oct 1, 2015 at 10:31 AM, Renato Golin <renato.golin at linaro.org>
wrote:

> Folks,
>
> David has been particularly militant with broken buildbots recently,
> so to make sure we don't throw the baby with the bath water, I'd like
> to propose some changes on how we deal with the emails on our
> *current* buildmaster, since there's no concrete plans to move it to
> anything else at the moment.
>
> The main issue is that managing the buildbots is not a simple task. It
> requires build owners to disable on the slave side, or specific people
> on the master side. The former can take as long as the owner wants
> (which is not nice), and the latter refreshes all active bots
> (triggering exceptions) and are harder to revert.
>
> We need to be pragmatic without re-writing the BuildBot product.
>
> Grab some popcorn...
>
> There are two main fronts that we need to discuss the noise: Bot and
> test stability.
>
>
>    1. Bot stability issues
>
> We need to distinguish between four classes of buildbots:
>
>   1.1. Fast && stable && green
>
> These buildbots normally finish under one hour, but most of the time
> under 1/2 hour and should be kept green as much as possible.
> Therefore, any reasonable noise in these bots are welcomed, since we
> want them to go back to green as soon as possible.
>
> They're normally the front-line, and usually catch most of the silly
> bugs. But we need some kind of policy that allows us to revert patches
> that break them for more than a few hours. We have an agreement
> already, and for me that's good enough. People might think
> differently.
>
> With the items 2.x below taken care of, we should keep the current
> state of our bots for this group.
>
>   1.2. One of: Slow || Unstable || Often Red
>
> These bots are special. They're normally *very* important, but have
> some issues, like slow hardware, not too many available boards, or
> they take long times to bisect and fix the bugs.
>
> These bots catch the *serious* bugs, like self-hosted Clang
> mis-compiling a long-running test which sometimes fails. They can
> produce noise, but when the noise is correct, we really need to listen
> to it. Writing software to understand that is non-trivial.
>
> So, the idea here is to have a few special treatments for each type of
> problem. For example, slow bots need more hardware to reduce the blame
> list. Unstable bot need more work to reduce spurious noise to a
> minimum (see 2.x below), and red bots *must* remain *silent* until
> they come back to green (see 2.x below).
>
> What we *don't* want is to disable or silence them after they're
> green. Most of the bugs they find are hard to debug, so the longer we
> take to fix it the harder it is to find out what happened. We need to
> know as soon as possible when they break.
>
>   1.3. Two of: Slow || Unstable || Often Red
>
> These bots are normally only important to their owners, and they are
> on the verge of being disabled. The only way to cope with those bots
> is to completely disable their emails / IRC messages, so that no one
> gets flooded with noise from broken bots.
>
> However, some bots on the 1.2 category fall into this one for short
> periods of time (~1 week), so we need to be careful with what we
> disable here. That's the key baby/bathwater issue.
>
> Any hard policy here will be wrong for some bots some of the time, so
> I'd love if we could all just trust the bot owners a bit when they say
> they're fixing the issue. However, bots that fall here for more than a
> month, or more often that a few times during a few months (I'm being
> vague on purpose), then we collectively decide to disable the bot.
>
> What I *don't* want is any two or three guys deciding to disable the
> buildbot of someone else because they can't stand the noise. Remember,
> people do take holidays once in a while, and they may be in the Amazon
> or the Sahara having well deserved rest. Returning to work and
> learning that all your bots are disabled for a week is not nice.
>
> So far, we have coped with noise, and the result is that people tend
> to ignore those bots, which means more work to the bot owner. This is
> not a good situation, and we want to move away from it, but we
> shouldn't flip all switches off by default. We can still be pragmatic
> about this as long as we improve the quality overall (see 2.x below)
> with time.
>
> In summary, bots that fall here for too long will have their emails
> disabled and candidates for removal in the next spring clean-up, but
> not immediately.
>
>   1.4. Slow && Unstable && Red
>
> These bots don't belong here. They should be moved elsewhere,
> preferably to a local buildmaster that you can control and that will
> never email people or upset our master if you need changes. I have
> such a local master myself and it's very easy to setup and maintain.
>
> They *do* have value to *you*, for example to show the progress of
> your features cleaning up the failures, or generating some benchmark
> numbers, but that's something that is very specific to your project
> and should remain separated.
>
> Any of these bots in LLVM Lab should be moved away / removed, but on
> consensus, including the bot owner if he/she is still available in the
> list.
>
>
>    2. Test stability issues
>
> These issues, as you may have noticed from the links above, apply to
> *all* bots. The less noise we have overall, the lower will be our
> threshold for kicking bots out of the critical pool, and the higher
> the value of the not-so-perfect buildbots to the rest of the
> community.
>
>   2.1 Failed vs Exception
>
> The most critical issue we have to fix is the "red -> exception ->
> red" issue. Basically, a bot is red (because you're still
> investigating the problem), then someone restarts the master, so you
> get an exception. The next build will be a failure, and the
> buildmaster recognises the status change and emails everyone. That's
> just wrong.
>
> We need to add an extra check to that logic where it searches down for
> the next non-exceptional status and compares to that, not just the
> immediately previous result.
>
> This is a no-brainer and I don't think anyone would be against it. I
> just don't know where this is done, I welcome the knowledge of more
> experienced folks.
>
>   2.2 Failure types
>
> The next obvious thing is to detect what the error is. If it's an SVN
> error, we *really* don't need to get an email. But this raises the
> problem that an SVN failure followed by a genuine failure will not be
> reported. So, the reporting mechanism also has to know what's the
> previously *reported* failure, not just the previous failure.
>
> Other failures, like timeout, can be either flaky hardware or broken
> codegen. A way to be conservative and low noise would be to only warn
> on timeouts IFF it's the *second* in a row.
>
> For all these adjustments, we'll need some form of walk-back on the
> history to find the previous genuine result, and we'll need to mark
> results with some metadata. This may involve some patches to buildbot.
>
>   2.3 Detecting fixed bots
>
> Another interesting feature, that is present in the "GreenBot" is a
> warning when a bot you broke was fixed. That, per se, is not a good
> idea if the noise levels are high, since this will probably double it.
>
> So, this feature can only be introduced *after* we've done the clean
> ups above. But once it's clean, having a "green" email will put the
> minds of everyone that haven't seen the "red" email yet to rest, as
> they now know they don't even need to look at it at all, just delete
> the email.
>
> For those using fetchmail, I'm sure you could create a rule to do that
> automatically, but that's optional. :)
>
>   2.4 Detecting new failures
>
> This is a wish-list that I have, for the case where the bots are slow
> and hard to debug and are still red. Assuming everything above is
> fixed, they will emit no noise until they go green again, however,
> while I'm debugging the first problem, others can appear. If that
> happens, *I* want to know, but not necessarily everyone else.
>
> So, a list of problems reported could be added to the failure report,
> and if the failure is different, the bot owner gets an email. This
> would have to play nice with exception statuses, as well as spurious
> failures like SVN or timeouts, so it's not an easy patch.
>
> The community at large would be already happy with all the changes
> minus this one, but folks that have to maintain slow hardware like me
> would appreciate this feature. :)
>
>
>
> Does any one have more concerns?
>
> AFAICS, we should figure out where the walk-back code needs to be
> inserted and that would get us 90% of the way. The other 10% will be
> to list all the buildbots, check their statuses, owners, and map into
> those categories, and take the appropriate action.
>
> Maybe we should also reduce the noise in the IRC channel further (like
> only first red, first green), but that's not my primary concern right
> now. Feel free to look into it if it is for you.
>
> cheers,
> --renato
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151001/00fcbabc/attachment.html>