[llvm-dev] Buildbot Noise

Tue Oct 6 03:49:17 PDT 2015

On 5 October 2015 at 22:28, David Blaikie <dblaikie at gmail.com> wrote:
>> These buildbots normally finish under one hour, but most of the time
>> under 1/2 hour and should be kept green as much as possible.
>> Therefore, any reasonable noise
>
> Not sure what kind of noise you're referring to here. Flaky fast builders
> would be a bad thing, still - so that sort of noise should still be
> questioned.

Sorry, I meant "noise" as in "sound", not as opposed to "signal".

These bots are assumed stable, otherwise they would be in another
category below.

> I'm not sure if we need extra policy here - but I don't mind documenting the
> common community behavior here to make it more clear.

Some people in the community behaves strongly different than others. I
sent this email because I felt we disagree in some fundamental
properties of the buildbots, and before we can agree to a common
strategy, there is no consensus or "common behaviour" to be
documented.

However, I agree, we don't need "policy", just "documented behaviour"
as usual. That was my intention when I said "policy".

> Long bisection is a function of not enough boards (producing large revision
> ranges for each run), generally - no? (or is there some other reason?)

It's not that simple. Some bugs appear after several iterations of
green results. It may sound odd, but I had at least three this year.

These are the hardest bugs to find and usually standard regression
scripts can't find them automatically, so I have to do most of the
investigation manually. This takes *a lot* of time.

> Generally all bots catch serious bugs.

That's not what I meant. Quick bots catch bad new tests (over-assuming
on CHECK lines, forgetting to specify the triple on RUN lines) as well
as simple code issues (32 vs 64 bits, new vs old compiler errors,
etc), just because they're the first to run on a different environment
than the developer uses. Slow bots are most of the time buffered
against those, since patches and fixes (or reverts) tend to come in
bundles, while the slow bot is building.

>> like self-hosted Clang
>> mis-compiling a long-running test which sometimes fails. They can
>> produce noise, but when the noise is correct, we really need to listen
>> to it. Writing software to understand that is non-trivial.
>
> Again, not sure which kind of noise you're referring to here - it'd be
> helpful to clarify/disambiguate.

Noise here is less "sound" and more "noisy signal". Some of the
"noise" in these bots are just noise, others are signal masquerading
as noise.

Of course, the higher the noise level, the harder it is to interpret
the signal, but as it's usual in science, sometimes the only signal we
have is a noisy one.

It's common for mathematicians to scoff the physicists lack of
precision, as is for them to to the same to chemists, then biologists,
etc. When you're on top, it seems folly that some people endure large
amounts of noise in their signal, but when you're at the bottom and
your only signal has a lot of noise, you have to work with it and make
do with what you have.

As I said above, it's not uncommon the case where a failure "passes"
the tests for a few iterations before failing. So, we're not talking
*only* at hardware noise, but also at the code level, which had
assumptions based on the host architecture that might not be valid on
other architectures. Most of us develop on x86 machines, so it's only
logical that PPC, MIPS and ARM buildbots will fail more often than x86
ones. But that's precisely the point of having those bots in the first
place.

Requesting to disable those bots because they generate noise is the
same as asking people to give their opinion about a product, show the
positive reviews, and sue the rest.

> But they are problems that need to be addressed, is the key - and arguably,
> until they are addressed, these bots should only report to the owner, not to
> contributors.

If we didn't have those bots already for many years, and if we had
another way of testing on those architectures, I'd agree with you. But
we don't.

I agree we need to improve. I agree it's the architecture specific
community's responsibility to do so. I just don't agree that we should
disable all noise (with signal, baby/bath) until we do so.

By the time we get there, all sorts of problems will have crept in,
and we'll enter a vicious cycle. Been there, done that.

> I still question whether these bots provide value to the community as a
> whole when they send email. If the investigation usually falls to the owners
> rather than the contributors, then the emails they send (& their presence on
> a broader dashboard) may not be beneficial.

Benefit is a spectrum. People have different thresholds. Your
threshold is tougher than mine because I'm used working on an
environment where the noise is almost as loud as the signal.

I don't think we should be bound to either of our thresholds, that's
why I'm opening the discussion to have a migration plan to produce
less noise. But that plan doesn't include killing bots just because
they annoy people.

If you plot a function of value ~ noise OP benefit, you have a surface
with maxima and minima. Your proposal is to set a threshold and cut
all the bots that fall on those minima that are below that line. My
proposal is to move all those bots as high as we can and only then,
cut the bots that didn't make it past the threshold.

> So to be actionable they need to have small blame lists and be reliable (low
> false positive rate). If either of those is compromised, investigation will
> fall to the owner and ideally they should not be present in email/core
> dashboard groups.

Ideally, this is where both of us want to be. Realistically, it'll
take a while to get there.

We need changes in the buildbot area, but there are also inherent
problems that cannot be solved.

Any new architecture (like AArch64) will have only experimental
hardware for years, and later on, experimental kernel, then
experimental tools, etc. When developing a new back-end for a
compiler, those unstable and rapidly evolving environments are the
*only* thing you have to test on.

You normally only have one of two (experimental means either *very*
expensive or priceless), so having multiple boxes per bot is highly
unlikely. It can also mean that the experimental device you got last
month is not supported any more because a new one is coming, so you'll
have to live with those bugs until you get the new one, which will
come with its own bugs.

For older ARM cores (v7), this is less of a problem, but since old ARM
hardware was never designed as production machines, their flakiness is
inherent of their form factor. It is possible to get them on a
stable-enough configuration, but it takes time, resources, excess
hardware and people constantly monitoring the infrastructure. We're
getting there, but we're not there yet.

I agree that this is mostly *my* problem and *I* should fix it, and
believe me I *want* to fix it, I just need a bit more time. I suspect
that the other platform folks feel the same way, so I'd appreciate a
little more respect when we talk about acceptable levels of noise and
effort.

> I disagree here - if the bots remain red, they should be addressed. This is
> akin to committing a problematic patch before you leave - you should
> expect/hope it is reverted quickly so that you're not interrupting
> everyone's work for a week.

Absolutely not!

Committing a patch and going on holidays is a disrespectful act. Bot
maintainers going on holidays is an inescapable fact.

Silencing a bot while the maintainer is a possible way around, but
disabling it is most disrespectful.

However, I'd like to remind you of the confirmation bias problem,
where people will look at the bot, think it's noise, silence the bot
when they could have easily fixed it. Later on, when the owner gets to
work, surprise new bugs that weren't caught will fill the first weeks.
We have to be extra careful when taking actions without the bot
owners' knowledge.

> I'm not quite sure I follow this comment. The less noise we have, the /more/
> problematic any remaining noise will be

Yes, I meant what you said. :)

Less noise, higher bar to meet.

> Depends on the error - if it's transient, then this is flakiness as always &
> should be addressed as such (by trying to remove/address the flakes).
> Though, yes, this sort of failure should, ideally, probably, go to the
> buildbot owner but not to users.

Ideally, SVN errors should go to the site admins, but let's not get
ahead of ourselves. :)

>> Other failures, like timeout, can be either flaky hardware or broken
>> codegen. A way to be conservative and low noise would be to only warn
>> on timeouts IFF it's the *second* in a row.
>
> I don't think this helps - this reduces the incidence, but isn't a real
> solution.

I agree.

> We should reduce the flakiness of hardware. If hardware is this
> unreliable, why would we be building a compiler for it?

Because that's the only hardware that exists.

> No user could rely on it to produce the right answer.

No user is building trunk every commit (ish). Buildbots are not meant
to be as stable as a user (including distros) would require. That's
why we have extra validation on releases.

Buildbots build potentially unstable compilers, otherwise we wouldn't
need buildbots in the first place.

>> For all these adjustments, we'll need some form of walk-back on the
>> history to find the previous genuine result, and we'll need to mark
>> results with some metadata. This may involve some patches to buildbot.
>
> Yeah, having temporally related buildbot results seems dubious/something I'd
> be really cautious about.

This is not temporal, it's just regarding exception as no-change
instead of success.

The only reason why it's success right now is because, the way we're
setup to email on every failure, we don't want to spam people when the
master is reloaded.

That's the wrong meaning for the wrong reason.

> I imagine one of the better options would be some live embedded HTML that
> would just show a green square/some indicator that the bot has been green at
> least once since this commit.

That would be cool! But I suspect at the cost of a big change in the
buildbots. Maybe not...

>> This is a wish-list that I have, for the case where the bots are slow
>> and hard to debug and are still red. Assuming everything above is
>> fixed, they will emit no noise until they go green again, however,
>> while I'm debugging the first problem, others can appear. If that
>> happens, *I* want to know, but not necessarily everyone else.
>
> This seems like the place where XFAIL would help you and everyone else. If
> the original test failure was XFAILed immediately, the bot would go green,
> then red again if a new failure was introduced. Not only would you know, but
> so would the auhtor of the change.

I agree in principle. I just worry that it's a lot easier to add an
XFAIL than to remove it later.

Though, it might be just a matter of documenting the common behaviour
and expecting people to follow through.

cheers,
--renato