<div dir="ltr">Hi Renato,<br><br>Very useful thoughts, thanks. Need to think what could be done about these.<br><br>I will add few comments from my side.<br><br>Buildmaster as is configured now should send notifications on status change only for <br>'successToFailure' and 'failureToSuccess' events, so always red bots should be quiet.<br><br>Also we have group of builders (experimental_scheduled_builders) in configuration file builders.py which also should be quiet. This is place for noisy unstable bots.<br><br>If these features are not working properly please let me know and I will also try to watch these.<br><br>Unfortunately buildbot currently does not distinguish test and build failures.<br><br>I am going to be away on vacation the whole next week, but will keep an eye on buildbot.<br><br>Thanks<br><br>Galina<br><br><div class="gmail_extra"><br><div class="gmail_quote">On Thu, Oct 1, 2015 at 10:31 AM, Renato Golin <span dir="ltr"><<a href="mailto:renato.golin@linaro.org" target="_blank">renato.golin@linaro.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Folks,<br>

<br>

David has been particularly militant with broken buildbots recently,<br>

so to make sure we don't throw the baby with the bath water, I'd like<br>

to propose some changes on how we deal with the emails on our<br>

*current* buildmaster, since there's no concrete plans to move it to<br>

anything else at the moment.<br>

<br>

The main issue is that managing the buildbots is not a simple task. It<br>

requires build owners to disable on the slave side, or specific people<br>

on the master side. The former can take as long as the owner wants<br>

(which is not nice), and the latter refreshes all active bots<br>

(triggering exceptions) and are harder to revert.<br>

<br>

We need to be pragmatic without re-writing the BuildBot product.<br>

<br>

Grab some popcorn...<br>

<br>

There are two main fronts that we need to discuss the noise: Bot and<br>

test stability.<br>

<br>

<br>

   1. Bot stability issues<br>

<br>

We need to distinguish between four classes of buildbots:<br>

<br>

  1.1. Fast && stable && green<br>

<br>

These buildbots normally finish under one hour, but most of the time<br>

under 1/2 hour and should be kept green as much as possible.<br>

Therefore, any reasonable noise in these bots are welcomed, since we<br>

want them to go back to green as soon as possible.<br>

<br>

They're normally the front-line, and usually catch most of the silly<br>

bugs. But we need some kind of policy that allows us to revert patches<br>

that break them for more than a few hours. We have an agreement<br>

already, and for me that's good enough. People might think<br>

differently.<br>

<br>

With the items 2.x below taken care of, we should keep the current<br>

state of our bots for this group.<br>

<br>

  1.2. One of: Slow || Unstable || Often Red<br>

<br>

These bots are special. They're normally *very* important, but have<br>

some issues, like slow hardware, not too many available boards, or<br>

they take long times to bisect and fix the bugs.<br>

<br>

These bots catch the *serious* bugs, like self-hosted Clang<br>

mis-compiling a long-running test which sometimes fails. They can<br>

produce noise, but when the noise is correct, we really need to listen<br>

to it. Writing software to understand that is non-trivial.<br>

<br>

So, the idea here is to have a few special treatments for each type of<br>

problem. For example, slow bots need more hardware to reduce the blame<br>

list. Unstable bot need more work to reduce spurious noise to a<br>

minimum (see 2.x below), and red bots *must* remain *silent* until<br>

they come back to green (see 2.x below).<br>

<br>

What we *don't* want is to disable or silence them after they're<br>

green. Most of the bugs they find are hard to debug, so the longer we<br>

take to fix it the harder it is to find out what happened. We need to<br>

know as soon as possible when they break.<br>

<br>

  1.3. Two of: Slow || Unstable || Often Red<br>

<br>

These bots are normally only important to their owners, and they are<br>

on the verge of being disabled. The only way to cope with those bots<br>

is to completely disable their emails / IRC messages, so that no one<br>

gets flooded with noise from broken bots.<br>

<br>

However, some bots on the 1.2 category fall into this one for short<br>

periods of time (~1 week), so we need to be careful with what we<br>

disable here. That's the key baby/bathwater issue.<br>

<br>

Any hard policy here will be wrong for some bots some of the time, so<br>

I'd love if we could all just trust the bot owners a bit when they say<br>

they're fixing the issue. However, bots that fall here for more than a<br>

month, or more often that a few times during a few months (I'm being<br>

vague on purpose), then we collectively decide to disable the bot.<br>

<br>

What I *don't* want is any two or three guys deciding to disable the<br>

buildbot of someone else because they can't stand the noise. Remember,<br>

people do take holidays once in a while, and they may be in the Amazon<br>

or the Sahara having well deserved rest. Returning to work and<br>

learning that all your bots are disabled for a week is not nice.<br>

<br>

So far, we have coped with noise, and the result is that people tend<br>

to ignore those bots, which means more work to the bot owner. This is<br>

not a good situation, and we want to move away from it, but we<br>

shouldn't flip all switches off by default. We can still be pragmatic<br>

about this as long as we improve the quality overall (see 2.x below)<br>

with time.<br>

<br>

In summary, bots that fall here for too long will have their emails<br>

disabled and candidates for removal in the next spring clean-up, but<br>

not immediately.<br>

<br>

  1.4. Slow && Unstable && Red<br>

<br>

These bots don't belong here. They should be moved elsewhere,<br>

preferably to a local buildmaster that you can control and that will<br>

never email people or upset our master if you need changes. I have<br>

such a local master myself and it's very easy to setup and maintain.<br>

<br>

They *do* have value to *you*, for example to show the progress of<br>

your features cleaning up the failures, or generating some benchmark<br>

numbers, but that's something that is very specific to your project<br>

and should remain separated.<br>

<br>

Any of these bots in LLVM Lab should be moved away / removed, but on<br>

consensus, including the bot owner if he/she is still available in the<br>

list.<br>

<br>

<br>

   2. Test stability issues<br>

<br>

These issues, as you may have noticed from the links above, apply to<br>

*all* bots. The less noise we have overall, the lower will be our<br>

threshold for kicking bots out of the critical pool, and the higher<br>

the value of the not-so-perfect buildbots to the rest of the<br>

community.<br>

<br>

  2.1 Failed vs Exception<br>

<br>

The most critical issue we have to fix is the "red -> exception -><br>

red" issue. Basically, a bot is red (because you're still<br>

investigating the problem), then someone restarts the master, so you<br>

get an exception. The next build will be a failure, and the<br>

buildmaster recognises the status change and emails everyone. That's<br>

just wrong.<br>

<br>

We need to add an extra check to that logic where it searches down for<br>

the next non-exceptional status and compares to that, not just the<br>

immediately previous result.<br>

<br>

This is a no-brainer and I don't think anyone would be against it. I<br>

just don't know where this is done, I welcome the knowledge of more<br>

experienced folks.<br>

<br>

  2.2 Failure types<br>

<br>

The next obvious thing is to detect what the error is. If it's an SVN<br>

error, we *really* don't need to get an email. But this raises the<br>

problem that an SVN failure followed by a genuine failure will not be<br>

reported. So, the reporting mechanism also has to know what's the<br>

previously *reported* failure, not just the previous failure.<br>

<br>

Other failures, like timeout, can be either flaky hardware or broken<br>

codegen. A way to be conservative and low noise would be to only warn<br>

on timeouts IFF it's the *second* in a row.<br>

<br>

For all these adjustments, we'll need some form of walk-back on the<br>

history to find the previous genuine result, and we'll need to mark<br>

results with some metadata. This may involve some patches to buildbot.<br>

<br>

  2.3 Detecting fixed bots<br>

<br>

Another interesting feature, that is present in the "GreenBot" is a<br>

warning when a bot you broke was fixed. That, per se, is not a good<br>

idea if the noise levels are high, since this will probably double it.<br>

<br>

So, this feature can only be introduced *after* we've done the clean<br>

ups above. But once it's clean, having a "green" email will put the<br>

minds of everyone that haven't seen the "red" email yet to rest, as<br>

they now know they don't even need to look at it at all, just delete<br>

the email.<br>

<br>

For those using fetchmail, I'm sure you could create a rule to do that<br>

automatically, but that's optional. :)<br>

<br>

  2.4 Detecting new failures<br>

<br>

This is a wish-list that I have, for the case where the bots are slow<br>

and hard to debug and are still red. Assuming everything above is<br>

fixed, they will emit no noise until they go green again, however,<br>

while I'm debugging the first problem, others can appear. If that<br>

happens, *I* want to know, but not necessarily everyone else.<br>

<br>

So, a list of problems reported could be added to the failure report,<br>

and if the failure is different, the bot owner gets an email. This<br>

would have to play nice with exception statuses, as well as spurious<br>

failures like SVN or timeouts, so it's not an easy patch.<br>

<br>

The community at large would be already happy with all the changes<br>

minus this one, but folks that have to maintain slow hardware like me<br>

would appreciate this feature. :)<br>

<br>

<br>

<br>

Does any one have more concerns?<br>

<br>

AFAICS, we should figure out where the walk-back code needs to be<br>

inserted and that would get us 90% of the way. The other 10% will be<br>

to list all the buildbots, check their statuses, owners, and map into<br>

those categories, and take the appropriate action.<br>

<br>

Maybe we should also reduce the noise in the IRC channel further (like<br>

only first red, first green), but that's not my primary concern right<br>

now. Feel free to look into it if it is for you.<br>

<br>

cheers,<br>

--renato<br>

</blockquote></div><br></div></div>