[llvm-dev] Buildbot Noise

Thu Oct 1 15:01:25 PDT 2015

I agree with almost everything you said.  A couple of comments inline.

On 10/01/2015 10:31 AM, Renato Golin via llvm-dev wrote:
> Folks,
>
> David has been particularly militant with broken buildbots recently,
> so to make sure we don't throw the baby with the bath water, I'd like
> to propose some changes on how we deal with the emails on our
> *current* buildmaster, since there's no concrete plans to move it to
> anything else at the moment.
>
> The main issue is that managing the buildbots is not a simple task. It
> requires build owners to disable on the slave side, or specific people
> on the master side. The former can take as long as the owner wants
> (which is not nice), and the latter refreshes all active bots
> (triggering exceptions) and are harder to revert.
>
> We need to be pragmatic without re-writing the BuildBot product.
>
> Grab some popcorn...
>
> There are two main fronts that we need to discuss the noise: Bot and
> test stability.
>
>
>     1. Bot stability issues
>
> We need to distinguish between four classes of buildbots:
>
>    1.1. Fast && stable && green
>
> These buildbots normally finish under one hour, but most of the time
> under 1/2 hour and should be kept green as much as possible.
> Therefore, any reasonable noise in these bots are welcomed, since we
> want them to go back to green as soon as possible.
>
> They're normally the front-line, and usually catch most of the silly
> bugs. But we need some kind of policy that allows us to revert patches
> that break them for more than a few hours. We have an agreement
> already, and for me that's good enough. People might think
> differently.
>
> With the items 2.x below taken care of, we should keep the current
> state of our bots for this group.
>
>    1.2. One of: Slow || Unstable || Often Red
>
> These bots are special. They're normally *very* important, but have
> some issues, like slow hardware, not too many available boards, or
> they take long times to bisect and fix the bugs.
>
> These bots catch the *serious* bugs, like self-hosted Clang
> mis-compiling a long-running test which sometimes fails. They can
> produce noise, but when the noise is correct, we really need to listen
> to it. Writing software to understand that is non-trivial.
>
> So, the idea here is to have a few special treatments for each type of
> problem. For example, slow bots need more hardware to reduce the blame
> list. Unstable bot need more work to reduce spurious noise to a
> minimum (see 2.x below), and red bots *must* remain *silent* until
> they come back to green (see 2.x below).
>
> What we *don't* want is to disable or silence them after they're
> green. Most of the bugs they find are hard to debug, so the longer we
> take to fix it the harder it is to find out what happened. We need to
> know as soon as possible when they break.
I view the three conditions as warranting somewhat different treatment.  
Specifically:

"slow" these are tolerable if annoying

"unstable" these should be removed immediately.  If the failure rate is 
more than 1 in 5 builds of a known clean revision, that's far too much 
noise to be notifying.  To be clear, I'm specifically referring to 
spurious *failures* not environmental factors which are global to all bots.

"often red" these are extremely valuable (msan, etc..).  Assuming we 
only notify on green->red, the only rule we should likely enforce is 
that each bot has been green "recently".  I'd suggest a threshold of 2 
months.  If it hasn't been green in 2 months, it's not really a build bot.
>    1.3. Two of: Slow || Unstable || Often Red
>
> These bots are normally only important to their owners, and they are
> on the verge of being disabled. The only way to cope with those bots
> is to completely disable their emails / IRC messages, so that no one
> gets flooded with noise from broken bots.
>
> However, some bots on the 1.2 category fall into this one for short
> periods of time (~1 week), so we need to be careful with what we
> disable here. That's the key baby/bathwater issue.
+1.  Any reasonable threshold is fine.  We just need to have one.
>
> Any hard policy here will be wrong for some bots some of the time, so
> I'd love if we could all just trust the bot owners a bit when they say
> they're fixing the issue. However, bots that fall here for more than a
> month, or more often that a few times during a few months (I'm being
> vague on purpose), then we collectively decide to disable the bot.
>
> What I *don't* want is any two or three guys deciding to disable the
> buildbot of someone else because they can't stand the noise. Remember,
> people do take holidays once in a while, and they may be in the Amazon
> or the Sahara having well deserved rest. Returning to work and
> learning that all your bots are disabled for a week is not nice.
So, maybe I'm missing something, but: why is it any harder to bring a 
silence bot green than an emailing one?
>
> So far, we have coped with noise, and the result is that people tend
> to ignore those bots, which means more work to the bot owner. This is
> not a good situation, and we want to move away from it, but we
> shouldn't flip all switches off by default. We can still be pragmatic
> about this as long as we improve the quality overall (see 2.x below)
> with time.
>
> In summary, bots that fall here for too long will have their emails
> disabled and candidates for removal in the next spring clean-up, but
> not immediately.
>
>    1.4. Slow && Unstable && Red
>
> These bots don't belong here. They should be moved elsewhere,
> preferably to a local buildmaster that you can control and that will
> never email people or upset our master if you need changes. I have
> such a local master myself and it's very easy to setup and maintain.
>
> They *do* have value to *you*, for example to show the progress of
> your features cleaning up the failures, or generating some benchmark
> numbers, but that's something that is very specific to your project
> and should remain separated.
>
> Any of these bots in LLVM Lab should be moved away / removed, but on
> consensus, including the bot owner if he/she is still available in the
> list.
>
>
>     2. Test stability issues
>
> These issues, as you may have noticed from the links above, apply to
> *all* bots. The less noise we have overall, the lower will be our
> threshold for kicking bots out of the critical pool, and the higher
> the value of the not-so-perfect buildbots to the rest of the
> community.
>
>    2.1 Failed vs Exception
>
> The most critical issue we have to fix is the "red -> exception ->
> red" issue. Basically, a bot is red (because you're still
> investigating the problem), then someone restarts the master, so you
> get an exception. The next build will be a failure, and the
> buildmaster recognises the status change and emails everyone. That's
> just wrong.
>
> We need to add an extra check to that logic where it searches down for
> the next non-exceptional status and compares to that, not just the
> immediately previous result.
>
> This is a no-brainer and I don't think anyone would be against it. I
> just don't know where this is done, I welcome the knowledge of more
> experienced folks.
>
>    2.2 Failure types
>
> The next obvious thing is to detect what the error is. If it's an SVN
> error, we *really* don't need to get an email. But this raises the
> problem that an SVN failure followed by a genuine failure will not be
> reported. So, the reporting mechanism also has to know what's the
> previously *reported* failure, not just the previous failure.
>
> Other failures, like timeout, can be either flaky hardware or broken
> codegen. A way to be conservative and low noise would be to only warn
> on timeouts IFF it's the *second* in a row.
>
> For all these adjustments, we'll need some form of walk-back on the
> history to find the previous genuine result, and we'll need to mark
> results with some metadata. This may involve some patches to buildbot.
>
>    2.3 Detecting fixed bots
>
> Another interesting feature, that is present in the "GreenBot" is a
> warning when a bot you broke was fixed. That, per se, is not a good
> idea if the noise levels are high, since this will probably double it.
>
> So, this feature can only be introduced *after* we've done the clean
> ups above. But once it's clean, having a "green" email will put the
> minds of everyone that haven't seen the "red" email yet to rest, as
> they now know they don't even need to look at it at all, just delete
> the email.
>
> For those using fetchmail, I'm sure you could create a rule to do that
> automatically, but that's optional. :)
>
>    2.4 Detecting new failures
>
> This is a wish-list that I have, for the case where the bots are slow
> and hard to debug and are still red. Assuming everything above is
> fixed, they will emit no noise until they go green again, however,
> while I'm debugging the first problem, others can appear. If that
> happens, *I* want to know, but not necessarily everyone else.
>
> So, a list of problems reported could be added to the failure report,
> and if the failure is different, the bot owner gets an email. This
> would have to play nice with exception statuses, as well as spurious
> failures like SVN or timeouts, so it's not an easy patch.
>
> The community at large would be already happy with all the changes
> minus this one, but folks that have to maintain slow hardware like me
> would appreciate this feature. :)
>
>
>
> Does any one have more concerns?
>
> AFAICS, we should figure out where the walk-back code needs to be
> inserted and that would get us 90% of the way. The other 10% will be
> to list all the buildbots, check their statuses, owners, and map into
> those categories, and take the appropriate action.
>
> Maybe we should also reduce the noise in the IRC channel further (like
> only first red, first green), but that's not my primary concern right
> now. Feel free to look into it if it is for you.
>
> cheers,
> --renato
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev