[llvm-dev] Buildbot Noise

Fri Oct 9 11:02:31 PDT 2015

On Fri, Oct 9, 2015 at 10:14 AM, Renato Golin <renato.golin at linaro.org>
wrote:

> I think we've hit a record in the number of inline replies, here... :)
>
> Let's start fresh...
>
>     Problem #1: What is flaky?
>
> The types of failures of a buildbot:
>
> 1. failures because of bad hardware / bad software / bad admin
> (timeout, disk full, crash, bad RAM)
>

Where "software" here is presumably the OS software, not the software under
test (otherwise all actual failures would be (1)), and not infrastructure
software because you've called that out as (2).

> 2. failures because of infrastructure problems (svn, lnt, etc)
> 3. failures due to previous or external commits unrelated to the blame
> list (intermittent, timeout)
> 4. results that you don't know how to act on, but you have to
> 5. clear error messages, easy to act on
>
> In my view, "flaky" is *only* number 1. Everything else is signal.
>

I think that misses the common usage of the term "flaky test" (or do the
tests themselves end up other (1) or (2)?) or flaky tests due to flaky
product code (hash ordering in the output).

> I agree that bots that cause 1. should be silent, and that failures in
> 2. and 3. should be only emailed to the bot admin. But category 4
> still needs to email the blame list and cannot be ignored, even if
> *you* don't know how to act on.
>

& I disagree here - if most contributors aren't acting on these (for
whatever reasons, basically) we should just stop sending them. If at some
point we find ways to make them actionable (by having common machine access
people can use, documentation on how to proceed, short blame lists, etc -
whatever's getting in the way of people acting on these).

And I don't think it's that people simply don't care about certain
architectures - We see Linux developers fixing Windows and Darwin build
breaks, for example. But, yes, more complicated things (I think a large
part of the problem is the temporal issue - no matter the architecture, if
the results are substantially delayed (even with a short blame list) and
the steps to reproduce are not quick/easy, it's easy for people to decide
it's not worth the hassle - which I think is something we likely have to
live with (again, lack of familiarity with a long/complex/inaccessible
process means that those developers really aren't in the best place to do
the reproduction/check that it was their patch that caused the problem)) do
tend to fall to bot owners/people familiar with that platform/hardware, and
I think that's totally OK/acceptable/the right thing.

>
> Type 2. can easily be separated, but I'm yet to see how are we going
> to code in which category each failure lies for types 3. and 4.

Yeah, I don't have any .particular insight there either. Ideally I'd hope
we can ensure those issues are rare enough (though I've been seeing some
consistently flaky SVN behavior on my buildbot for the last few months,
admittedly - reached out to Tanya about it, but didn't have much to go on)
that it's probably not worth the engineering effort to filter them out.

> One
> way to work around the problem in 4 is to print the bot owner's name
> on the email, so that you know who to reply to, for more details on
> what to do. How to decide if your change is unrelated or you didn't
> understand is a big problem.

What I'm suggesting is that if most developers, most of the time, aren't
able to determine this easily, it's not valuable email - if most of the
time they have to reach out to the owner for details/clarification, then we
should just invert it. Have the bot owner push to the contributor rather
than the contributor pull from the bot owner.

> Once all bots are low-noise, people will
> tend more to 4, until then, to 3 or 1.
>
> In agreement?
>
>
>     Problem #2: Breakage types
>
> Bots can break for a number of reasons in category 4. Some examples:
>
> A. silly, quick fixed ones, like bad CHECK lines, missing explicit
> triple, move tests to target-specific directories, add an include
> file.
> B. real problems, like an assert in the code, seg fault, bad test results.
> C. hard problems, like bad codegen affecting self-hosting,
> intermittent failures in test-suite or self-hosted clang.
>
> Problems of type A. tend to show by the firehose on ARM, while they're
> a lot less common on x86_64 bots just because people develop on
> x86_64.

They show up often enough cross-OS and build config too (-Asserts, Windows,
Darwin, etc).

> Problems B. and C. and equally common on all platforms due to
> the complexity of the compiler.
>
> Problems of type B. should have same behaviour in all platforms. If
> the bots are fast enough (either fast hardware, or many hardware), the
> blame list should be small and bisect should be quick (<1day).

Patches should still be reverted, or tests XFAIL - bots shouldn't be left
red for hours (especially in the middle of a work day) or a day.

> These are not the problem.
>
> Problems of type C, however, are seriously worse on slow targets.

This can often/mostly be compensated for by having more hardware -
especially for something as mechanical as a bisect. (obviously once you're
in manual iterations, more hardware doesn't help much unless you have a few
different hypotheses you can test simultaneously)

Certainly it takes some more engineering effort and there's overhead for
dealing with multiple machines, etc. But it's not linearly proportional to
machine speed, because some of it can be compensated for.

> Not
> only it's slower to build (sometimes 10x slower than on a decent
> server), but the testing is hard to get right (because it's
> intermittent), and until you get it right, you're actively working on
> that (minus sleep time, etc). Since we're talking about an order of
> magnitude slower to debug, sleep time becomes a much bigger issue. If
> a hard problem takes about 5 hours on fast hardware, it can take up to
> 50 hours, and in that case, no one can work that long. If you do 10hs
> straight every day, it's still a week past.

Sure - some issues take a while to investigate. No doubt - but so long as
the issue is live (be it flaky or consistent) it's unhelpful (moreso if
it's flaky, given the way our buildbots send mail - though I still don't
like a red line on the status page, that's costly too) to have the bot red
and/or sending mail. The issue is known and being investigated, sending
other people mail (or having it show up as red in the dashboard) isn't
terribly helpful. It produces redundant work for everyone (they all
investigate these issues - or learn to ignore them & thus miss true
positives later) on the project.

>
> In agreement?
>
>
> I'll continue later, once we're in agreement over the base facts.
>
> cheers,
> --renato
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151009/f6a672bd/attachment.html>