[llvm-dev] Buildbot Noise

Fri Oct 9 10:14:45 PDT 2015

I think we've hit a record in the number of inline replies, here... :)

Let's start fresh...

    Problem #1: What is flaky?

The types of failures of a buildbot:

1. failures because of bad hardware / bad software / bad admin
(timeout, disk full, crash, bad RAM)
2. failures because of infrastructure problems (svn, lnt, etc)
3. failures due to previous or external commits unrelated to the blame
list (intermittent, timeout)
4. results that you don't know how to act on, but you have to
5. clear error messages, easy to act on

In my view, "flaky" is *only* number 1. Everything else is signal.

I agree that bots that cause 1. should be silent, and that failures in
2. and 3. should be only emailed to the bot admin. But category 4
still needs to email the blame list and cannot be ignored, even if
*you* don't know how to act on.

Type 2. can easily be separated, but I'm yet to see how are we going
to code in which category each failure lies for types 3. and 4. One
way to work around the problem in 4 is to print the bot owner's name
on the email, so that you know who to reply to, for more details on
what to do. How to decide if your change is unrelated or you didn't
understand is a big problem. Once all bots are low-noise, people will
tend more to 4, until then, to 3 or 1.

In agreement?

    Problem #2: Breakage types

Bots can break for a number of reasons in category 4. Some examples:

A. silly, quick fixed ones, like bad CHECK lines, missing explicit
triple, move tests to target-specific directories, add an include
file.
B. real problems, like an assert in the code, seg fault, bad test results.
C. hard problems, like bad codegen affecting self-hosting,
intermittent failures in test-suite or self-hosted clang.

Problems of type A. tend to show by the firehose on ARM, while they're
a lot less common on x86_64 bots just because people develop on
x86_64. Problems B. and C. and equally common on all platforms due to
the complexity of the compiler.

Problems of type B. should have same behaviour in all platforms. If
the bots are fast enough (either fast hardware, or many hardware), the
blame list should be small and bisect should be quick (<1day). These
are not the problem.

Problems of type C, however, are seriously worse on slow targets. Not
only it's slower to build (sometimes 10x slower than on a decent
server), but the testing is hard to get right (because it's
intermittent), and until you get it right, you're actively working on
that (minus sleep time, etc). Since we're talking about an order of
magnitude slower to debug, sleep time becomes a much bigger issue. If
a hard problem takes about 5 hours on fast hardware, it can take up to
50 hours, and in that case, no one can work that long. If you do 10hs
straight every day, it's still a week past.

In agreement?

I'll continue later, once we're in agreement over the base facts.

cheers,
--renato