[llvm-dev] Buildbot Noise

Mon Oct 5 14:28:23 PDT 2015

On Thu, Oct 1, 2015 at 10:31 AM, Renato Golin <renato.golin at linaro.org>
wrote:

> Folks,
>
> David has been particularly militant with broken buildbots recently,
> so to make sure we don't throw the baby with the bath water, I'd like
> to propose some changes on how we deal with the emails on our
> *current* buildmaster, since there's no concrete plans to move it to
> anything else at the moment.
>
> The main issue is that managing the buildbots is not a simple task. It
> requires build owners to disable on the slave side, or specific people
> on the master side. The former can take as long as the owner wants
> (which is not nice), and the latter refreshes all active bots
> (triggering exceptions) and are harder to revert.
>
> We need to be pragmatic without re-writing the BuildBot product.
>
> Grab some popcorn...
>
> There are two main fronts that we need to discuss the noise: Bot and
> test stability.
>
>
>    1. Bot stability issues
>
> We need to distinguish between four classes of buildbots:
>
>   1.1. Fast && stable && green
>
> These buildbots normally finish under one hour, but most of the time
> under 1/2 hour and should be kept green as much as possible.
> Therefore, any reasonable noise

Not sure what kind of noise you're referring to here. Flaky fast builders
would be a bad thing, still - so that sort of noise should still be
questioned.

> in these bots are welcomed, since we
> want them to go back to green as soon as possible.
>
> They're normally the front-line, and usually catch most of the silly
> bugs. But we need some kind of policy that allows us to revert patches
> that break them for more than a few hours.

I'm not sure if we need extra policy here - but I don't mind documenting
the common community behavior here to make it more clear.

Essentially: if you've provided a contributor with a way to reproduce the
issue, and it seems to clearly be a valid issue, revert to green & let them
look at the reproduction when they have time. We do this pretty regularly
(especially outside office hours when we don't expect someone will be
around to revert it themselves - but honestly, I don't see that as a
requirement - if you've provided the evidence for them to investigate,
revert first & they can investigate whenever they get to it, sooner or
later)

> We have an agreement
> already, and for me that's good enough. People might think
> differently.
>
> With the items 2.x below taken care of, we should keep the current
> state of our bots for this group.
>
>   1.2. One of: Slow || Unstable || Often Red
>
> These bots are special. They're normally *very* important, but have
> some issues, like slow hardware, not too many available boards, or
> they take long times to bisect and fix the bugs.
>

Long bisection is a function of not enough boards (producing large revision
ranges for each run), generally - no? (or is there some other reason?)

> These bots catch the *serious* bugs,

Generally all bots catch serious bugs - it's just a long tail: fast easy to
find bugs, then longer tests find the harder to find bugs, and so on and so
forth. (until we get below the value/bug thershold where it's not worth
expending the CPU cycles to find the next bug)

> like self-hosted Clang
> mis-compiling a long-running test which sometimes fails. They can
> produce noise, but when the noise is correct, we really need to listen
> to it. Writing software to understand that is non-trivial.
>

Again, not sure which kind of noise you're referring to here - it'd be
helpful to clarify/disambiguate. Flaky or often-red results on slow
buildbots without enough resources (long blame lists) are pretty easily
ignored ("oh, it could be any of those 20 other people's patches, I'll just
ignore it - someone else will do the work & tell me if it's my fault").

> So, the idea here is to have a few special treatments for each type of
> problem.

But they are problems that need to be addressed, is the key - and arguably,
until they are addressed, these bots should only report to the owner, not
to contributors. (as above - if people generally ignore them because
they're not accurate enough to believe that it's 'your' fault, then they
essentially are already leaving it to the owner to do the investigation -
they just have extra email they have to ignore too, let's remove the email
so that we can make those we send more valuable by not getting lost in the
noise)

> For example, slow bots need more hardware to reduce the blame
> list.

Definitely ^.

> Unstable bot need more work to reduce spurious noise to a
> minimum (see 2.x below), and red bots *must* remain *silent* until
> they come back to green (see 2.x below).
>

As I mentioned on IRC/other threads - having red bots, even if they don't
send email, does come at some cost. It makes dashboards hard to read. So
for those trying to get a sense of the overall state of the project (what's
on fire/what needs to be investigated) this can be problematic. Having
issues XFAILed (with a bug filed, or someone otherwise owning the issue
until the XFAIL is removed) or reverted aggressively or having bots moved
into a separate group so that there's a clear "this is the stuff we should
expect to be green all the time" group that can be eyeballed quickly, is
nice.

> What we *don't* want is to disable or silence them after they're
> green. Most of the bugs they find are hard to debug, so the longer we
> take to fix it the harder it is to find out what happened. We need to
> know as soon as possible when they break.
>

I still question whether these bots provide value to the community as a
whole when they send email. If the investigation usually falls to the
owners rather than the contributors, then the emails they send (& their
presence on a broader dashboard) may not be beneficial.

So to be actionable they need to have small blame lists and be reliable
(low false positive rate). If either of those is compromised, investigation
will fall to the owner and ideally they should not be present in email/core
dashboard groups.

>
>   1.3. Two of: Slow || Unstable || Often Red
>
> These bots are normally only important to their owners, and they are
> on the verge of being disabled.

I don't think they have to be on the verge of being disabled - so long as
they don't send email and are in a separate group, I don't see any problem
with them being on the main llvm buildbot. (no particular benefit either, I
suppose - other than saving the owner the hassle of running their own
master, which is fine)

> The only way to cope with those bots
> is to completely disable their emails / IRC messages, so that no one
> gets flooded with noise from broken bots.
>

Yep

> However, some bots on the 1.2 category fall into this one for short
> periods of time (~1 week), so we need to be careful with what we
> disable here. That's the key baby/bathwater issue.
>
> Any hard policy here will be wrong for some bots some of the time, so
> I'd love if we could all just trust the bot owners a bit when they say
> they're fixing the issue.

It's not a question of trust, from my perspective - regardless of whether
they will address the issue or not, the emails add noise and decrease the
overall trust developers have in the signal (via email, dashboards and IRC)
from the buildbots.

If an issue is being investigated we have tools to deal with that: XFAIL,
revert, and buildbot reconfig (we could/should check if the reconfig for
email configuration can be done without a restart - yes, it still relies on
a buildbot admin to be available (perhaps we should have more people
empowered to reconfig the buildmaster to make this cheaper/easier) but
without the interruption to all builds).

If there's enough hardware that blame lists are small and the bot is
reliable, then reverts can happen aggressively. If not, XFAIL is always an
option too.

> However, bots that fall here for more than a
> month, or more often that a few times during a few months (I'm being
> vague on purpose), then we collectively decide to disable the bot.
>
> What I *don't* want is any two or three guys deciding to disable the
> buildbot of someone else because they can't stand the noise. Remember,
> people do take holidays once in a while, and they may be in the Amazon
> or the Sahara having well deserved rest. Returning to work and
> learning that all your bots are disabled for a week is not nice.
>

I disagree here - if the bots remain red, they should be addressed. This is
akin to committing a problematic patch before you leave - you should
expect/hope it is reverted quickly so that you're not interrupting
everyone's work for a week.

If your bot is not flakey and has short blame lists, I think it's possibly
reasonable to expect that people should revert their patches rather than
disable the bot or XFAIL the test on that platform. But without access to
hardware it may be hard for them to investigate the failure - XFAIL is
probably the right tool, then when the owner is back they can provide a
reproduction, extra logs, help remote-debug it, etc.

> So far, we have coped with noise, and the result is that people tend
> to ignore those bots, which means more work to the bot owner.

The problem is, that work doesn't only fall on the owners of the bots which
produce the noise. It falls on all bot owners because developers become
immune/numb to bot failure mail to a large degree.

> This is
> not a good situation, and we want to move away from it, but we
> shouldn't flip all switches off by default. We can still be pragmatic
> about this as long as we improve the quality overall (see 2.x below)
> with time.
>
> In summary, bots that fall here for too long will have their emails
> disabled and candidates for removal in the next spring clean-up, but
> not immediately.
>
>   1.4. Slow && Unstable && Red
>
> These bots don't belong here. They should be moved elsewhere,
> preferably to a local buildmaster that you can control and that will
> never email people or upset our master if you need changes. I have
> such a local master myself and it's very easy to setup and maintain.
>

Yep - bots that are only useful to the owner (some of the situations above
I think constitute this situation, but anyway) shouldn't email/show up in
the main buildbot group. But I wouldn't mind if we had a separate grouping
in the dashboards for these bots (I think we have an experimental group
which is somewhat like this). No big deal either way to me. If they're not
sending mail/IRC messages, and they're not in the main group on the
dashboard, I'm OK with it.

> They *do* have value to *you*, for example to show the progress of
> your features cleaning up the failures, or generating some benchmark
> numbers, but that's something that is very specific to your project
> and should remain separated.
>
> Any of these bots in LLVM Lab should be moved away / removed, but on
> consensus, including the bot owner if he/she is still available in the
> list.
>
>
>    2. Test stability issues
>
> These issues, as you may have noticed from the links above, apply to
> *all* bots. The less noise we have overall, the lower will be our
> threshold for kicking bots out of the critical pool, and the higher
> the value of the not-so-perfect buildbots to the rest of the
> community.
>

I'm not quite sure I follow this comment. The less noise we have, the
/more/ problematic any remaining noise will be (because it'll be costing us
more relative to no-noise - when we have lots of noise, any one specific
source of noise isn't critical, we can remove it but it won't change much -
when there's a little noise, removing any one source substantially
decreases our false positive rate, etc)

>
>   2.1 Failed vs Exception
>
> The most critical issue we have to fix is the "red -> exception ->
> red" issue. Basically, a bot is red (because you're still
> investigating the problem), then someone restarts the master, so you
> get an exception. The next build will be a failure, and the
> buildmaster recognises the status change and emails everyone. That's
> just wrong.
>
> We need to add an extra check to that logic where it searches down for
> the next non-exceptional status and compares to that, not just the
> immediately previous result.
>
> This is a no-brainer and I don't think anyone would be against it. I
> just don't know where this is done, I welcome the knowledge of more
> experienced folks.
>

Yep, sounds like we might be able to have Galina look into that. I have no
context there about where that particular behavior might be (whether it's
in the buildbot code itself, or in the user-provided buildbot
configuration, etc).

>
>   2.2 Failure types
>
> The next obvious thing is to detect what the error is. If it's an SVN
> error, we *really* don't need to get an email.

Depends on the error - if it's transient, then this is flakiness as always
& should be addressed as such (by trying to remove/address the flakes).
Though, yes, this sort of failure should, ideally, probably, go to the
buildbot owner but not to users.

> But this raises the
> problem that an SVN failure followed by a genuine failure will not be
> reported. So, the reporting mechanism also has to know what's the
> previously *reported* failure, not just the previous failure.
>
> Other failures, like timeout, can be either flaky hardware or broken
> codegen. A way to be conservative and low noise would be to only warn
> on timeouts IFF it's the *second* in a row.
>

I don't think this helps - this reduces the incidence, but isn't a real
solution. We should reduce the flakiness of hardware. If hardware is this
unreliable, why would we be building a compiler for it? No user could rely
on it to produce the right answer. (& again, if the flakiness is bad enough
- I think that goes back to an owner-triaged bot, one that doesn't send
mail, etc)

> For all these adjustments, we'll need some form of walk-back on the
> history to find the previous genuine result, and we'll need to mark
> results with some metadata. This may involve some patches to buildbot.
>

Yeah, having temporally related buildbot results seems dubious/something
I'd be really cautious about.

>   2.3 Detecting fixed bots
>
> Another interesting feature, that is present in the "GreenBot" is a
> warning when a bot you broke was fixed. That, per se, is not a good
> idea if the noise levels are high, since this will probably double it.
>
> So, this feature can only be introduced *after* we've done the clean
> ups above. But once it's clean, having a "green" email will put the
> minds of everyone that haven't seen the "red" email yet to rest, as
> they now know they don't even need to look at it at all, just delete
> the email.
>
> For those using fetchmail, I'm sure you could create a rule to do that
> automatically, but that's optional. :)
>

Yeah, I don't know what the right solution is here at all - but it
certainly would be handy if there were an easier way to tell if an issue
has been resolved since your commit.

I imagine one of the better options would be some live embedded HTML that
would just show a green square/some indicator that the bot has been green
at least once since this commit.

(that doesn't help if you introduced a flaky test, though... - that's
harder to deal with/convey to users, repeated test execution may be
necessary in that case - that's when temporal information may be useful)

>
>   2.4 Detecting new failures
>
> This is a wish-list that I have, for the case where the bots are slow
> and hard to debug and are still red. Assuming everything above is
> fixed, they will emit no noise until they go green again, however,
> while I'm debugging the first problem, others can appear. If that
> happens, *I* want to know, but not necessarily everyone else.
>

This seems like the place where XFAIL would help you and everyone else. If
the original test failure was XFAILed immediately, the bot would go green,
then red again if a new failure was introduced. Not only would you know,
but so would the auhtor of the change.

>
> So, a list of problems reported could be added to the failure report,
> and if the failure is different, the bot owner gets an email. This
> would have to play nice with exception statuses, as well as spurious
> failures like SVN or timeouts, so it's not an easy patch.
>
> The community at large would be already happy with all the changes
> minus this one, but folks that have to maintain slow hardware like me
> would appreciate this feature. :)
>
>
>
> Does any one have more concerns?
>
> AFAICS, we should figure out where the walk-back code needs to be
> inserted and that would get us 90% of the way. The other 10% will be
> to list all the buildbots, check their statuses, owners, and map into
> those categories, and take the appropriate action.
>
> Maybe we should also reduce the noise in the IRC channel further (like
> only first red, first green), but that's not my primary concern right
> now. Feel free to look into it if it is for you.
>
> cheers,
> --renato
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151005/4ef8bbf2/attachment.html>