[llvm-dev] Buildbot Noise

Mon Oct 19 11:38:20 PDT 2015

On Sat, Oct 10, 2015 at 4:59 AM, Renato Golin <renato.golin at linaro.org>
wrote:

> On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com> wrote:
> > Where "software" here is presumably the OS software
>
> Yes. This is the real noise, one that we cannot accept.
>
>
> > I think that misses the common usage of the term "flaky test" (or do the
> > tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> > product code (hash ordering in the output).
>
> Flaky code, either compiler or tests, are the ones that don't fail in
> the correct blame list. Otherwise, even if it was flaky, we don't
> know, because it failed in the right blame list, so it's easy to
> revert or XFAIL.
>
> So, in my categorisation, flaky code ends up in either 3 or 4:
>
> 3, wrong blame list: if the failure is completely independent from the
> blame list, example, misuse of the C++ ABI.
> 4, related, but not directly: if the failure is related, but in ways
> that the patch didn't touch, example, changing related debug info for
> a non-debug patch.
>
> These can be that the original code didn't cope with this future
> change, but the change is semantically valid, or the test CHECK lines
> were poor (like naming explicit registers, etc), and that's why the
> tests broke. The former is harder for the blamed developer to fix, but
> "git blame" can help find the one to help. The latter is a lot easier
> to spot and fix, but is also helped by "git blame". Both actionable,
> but not immediately obvious.
>
>
> > & I disagree here - if most contributors aren't acting on these (for
> > whatever reasons, basically) we should just stop sending them. If at some
> > point we find ways to make them actionable (by having common machine
> access
> > people can use, documentation on how to proceed, short blame lists, etc -
> > whatever's getting in the way of people acting on these).
>
> I see, your disagreement is temporal.
>
> You're basically saying that, because people ignore them today,
> there's no point in sending them the email today, and it's up to the
> bot owners to make people start paying attention to their bots.
>
> My argument is that I cannot make you care, no matter how stable my
> bots are. And the evidence for that is that my bots are very stable,
> but you're ignoring them, either because you don't understand what a
> flaky bot is, or just out of principle.
>

In the proximal issue - the bot was red for a week. When I see a bot red
for a week, I assume no one cares about it (because I assume that if they
did they would've at least XFAILed the issue so they could get back to
green & catch future issues). That's the question I was asking and the
reason I'm inclined to ignore the email I got from that bot.

As you've pointed out, the reason I got email from the bot was because of
the master restart (red->purple->red), and addressing that would mean I
wouldn't've sent my original email to you (but to other bot masters who had
long-red bots - as you can see, I wasn't singling you out, I was looking at
any bot that had been red for multiple work days). I would still, in the
abstract, disagree with leaving bots red for long periods because it makes
the buildbot status pages hard to read - which things are unknown issues
that someone needs to investigate, and which aren't? XFAIL should represent
the mechanism by which we acknowledge a known failure, get back to green,
and investigate. XFAILing a bootstrap is a bit unknown - perhaps we should
have a way to do that?

Beyond that, I've been talking about flakey failures in general, but that
wasn't my issue with your bot at the time I sent the mail. I have no
opinion on the flakiness of your bot(s). I think we got caught down a
rathole talking about the abstract problems of flakiness, even though when
I sent my last volley of "what's with these bot results" they weren't about
flakiness at all, but /specifically/ about long-red bots that appear
neglected.

> My bots don't have hardware or OS problems, nor they timeout or run
> out of disk for a good number of years. But I can't stop bad testing,
> or bad coding. And, as I've outlined too many times, these affect bots
> like mine more heavily than others. It's the nature of the failures
> plus the nature of my hardware.
>
> I can't make you care about it, so I don't mind if you ignore them,
>

Are there often original contributors, faced with a unique result from
these bots, who are addressing the problem themselves? Or do they usually
have to defer to you or another expert in this hardware, to do some level
of triage/investigation/reproduction first?

> but I *do* mind if you want to shut them off.
>

As I've said before - I'm suggesting not sending mail. I'm not suggesting
turning them off.

It would be little-to-no change to me to do this to my GDB 7.5 bot, for
example - I glance at every failure that comes through anyway. All I'd do
differently is forward anything that I thought looked like a real, unique
failure, to the mailing list/blame list, rather than having it done
automatically. This does not seem terribly onerous. Is it?

> > And I don't think it's that people simply don't care about certain
> > architectures - We see Linux developers fixing Windows and Darwin build
> > breaks, for example. But, yes, more complicated things (I think a large
> part
> > of the problem is the temporal issue - no matter the architecture, if the
> > results are substantially delayed (even with a short blame list) and the
> > steps to reproduce are not quick/easy, it's easy for people to decide
> it's
> > not worth the hassle
>
> I think that's an appalling behaviour for a community.
>

I... don't, really. As with my own GDB 7.5 buildbot, I pretty much assume
interesting failures will probably involve me helping to triage (especially
with the Apple engineers explicitly not having access to the source/test
cases run there) the issues. The bot sends me email on every red, and I
treat that as pretty much a thing I need to care about until it's green, as
much as possible by acting as a facilitator to the original contributor who
committed the breakage.

> > - which I think is something we likely have to live
> > with (again, lack of familiarity with a long/complex/inaccessible process
> > means that those developers really aren't in the best place to do the
> > reproduction/check that it was their patch that caused the problem)) do
> tend
> > to fall to bot owners/people familiar with that platform/hardware, and I
> > think that's totally OK/acceptable/the right thing.
>
> Hum, ok. There are two sides here.
>
> 1. You do care, but can't do anything. In this case, you work with the
> owner to resolve the problem, even if the owner does all the work.
>
> 2. You don't care, and ignore the failure. Here the bot owner has to
> find out on his own and do all the work.
>
> The first is perfectly acceptable, and I'm more than happy to do all
> the work. The second I normally just revert the patch without asking.
>

It's generally not the community policy to revert a patch without providing
actionable reproduction steps, etc. Do you do that? I don't recall seeing
that done. (in general, I think it better to get reproduction steps first,
then revert - sometimes people revert first and provide reproduction much
later (because a reduction takes time, etc) - which I don't think is ideal,
but is sometimes the right tradeoff for the community (if it's obviously
going to be/is a problem for everyone, we're just not all seeing it yet,
etc))

> > What I'm suggesting is that if most developers, most of the time, aren't
> > able to determine this easily, it's not valuable email - if most of the
> time
> > they have to reach out to the owner for details/clarification, then we
> > should just invert it. Have the bot owner push to the contributor rather
> > than the contributor pull from the bot owner.
>
> The LLVM project has hundreds of committers, dozens of bots have a
> single owner. How does that scale?
>

Most of the bot results are pretty easily actionable - just by reading the
diagnostics from the bots, etc. I run a bot - I glance at every fail mail
that comes from it. It does not seem to be terribly onerous to me to do
this - is it for you? The only time it costs me more than sub-second per
failure is if it's a real issue I need to investigate (OK, if it's actually
a GDB test failure that's just flakey, that costs me a few seconds, but
still not long)

The point is that doing the opposite: sending mail to large blame lists is
strictly higher cost than having a bot owner do the work. A bot owner is 1
person, a large blame list is multiple. It scales better to have 1 person
look at the failure rather than many. Also non-owners are less familiar
with the interesting failures from the bot (or the ongoing state - red or
otherwise) so it costs them more than the owner.

A long red bot is a worse example of this, if it's sending mail eevn on a
few reds - that's multiple developers looking at the bot to see if they
broke it, when it's already known broken and being investigated. Every one
of those emails is costly/worse scaling than just sending mail to the owner
& having the owner triage/escalate to the contributor.

> I think this proposal is against the very nature of open source
> projects in general and a horrible engineering decision.

Do you believe there's no quality point in a buildbot notification where it
is not worth sending mail/notification? Where those notifications hurt the
quality (by reducing the signal/noise to the point where we either hurt the
throughput of developers by having substantially redundant (& unskilled in
the specific kinds of failures a certain platform might see) failure
investigation or hurt the quality of the project by people learning to
ignore bot mails in general and thus missing important true positives as
well?)

> I have
> noticed that recently some people have taken the attitude that "if you
> can't keep up with my commits, you're not worth noticing",

Not quite sure what you're referring to here - we seem to be pretty good
about moving fast, but also having important design discussions in the
community (llvm-dev mailing list, etc) when there's input required or
people need a bit of forewarning about a change in direction, etc.

I think it's not too unreasonable to expect people to check some of the
commit history to see what's been going on in an area they're interested in
(if they're contributors - if they're not contributors, yes, we don't tend
to care much) some recent failure they're seeing, etc.

> and that's
> the attitude that will get us forked.
>

I don't really see the concern of that (I don't really understand the
chance of this, or what causes projects to be forked, nor the cost if they
are).

>
>
> > They show up often enough cross-OS and build config too (-Asserts,
> Windows,
> > Darwin, etc).
>
> Ok, good.
>
>
> > Patches should still be reverted, or tests XFAIL - bots shouldn't be left
> > red for hours (especially in the middle of a work day) or a day.
>
> How do you XFAIL a Clang miscompilation of Clang?
>

It's a good question - seems like it'd be something we might want to have
some way of doing. Perhaps we could have some stub test cases that are used
to describe some of these sort of tests.

> How do you revert a failure that is unrelated to the blame list
> because they're from previous or external commits?
>

external?

If they're from previous commits/it's a flakey product issue - that's
tricky, for sure. We don't have good infrastructure for that. It would be
nice to build some (we could run flake detection in off-peak times - tests
that are suspected of being flakey could be run repeatedly to see if they
are, etc), but non-triival to do so, for sure. For now, I don't know that
that's the long pole - though there are some notable exceptions (windows
filesystem IO caused some ongoing flakes on windows, which I think should
be an issue for those running the windows buildbots)

>
>
> > This can often/mostly be compensated for by having more hardware -
>
> Throw money at the problem? :D
>

Sure, if that's what it takes - we're already paying for the problem with
engineering time. I'm suggesting that maybe that cost shouldn't be
distributed across the project, but rather localized to those invested
(literally, financially) in the behavior of the platforms in question.

> https://www.youtube.com/watch?v=CZmHDEa0Y20
>
>
> > especially for something as mechanical as a bisect. (obviously once
> you're
> > in manual iterations, more hardware doesn't help much unless you have a
> few
> > different hypotheses you can test simultaneously)
>
> I don't have infinite hardware, nor infinite space, nor infinite
> power, nor infinite time.
>

None of these things require infinite anything. There's a "reasonable"
level of turnaround that can help quite a bit.

> Certain things take longer than others, and people that are used to
> getting them fast have a lower tolerance for slow(er) processes. Fast
> and slow are completely arbitrary and relative to how slow or fast
> things are between themselves.
>

I don't think they're entirely arbitrary (there are certain broad cutoffs
where the productivity loss is more noticable as you transition from one
way to another way of doing things (eg: once your build takes more than a
few seconds, you're likely to context switch away then come back to it,
etc)). But even if they are, I don't think it's entirely wrong to strive to
have a system that is fast.

> > Certainly it takes some more engineering effort and there's overhead for
> > dealing with multiple machines, etc. But it's not linearly proportional
> to
> > machine speed, because some of it can be compensated for.
>
> Right. So, here, I agree with you. It IS possible to improve and make
> it much better.
>
> I'm working on making it better, but it takes time. I can't make it
> work tomorrow, and that's my original point:
>
> We have to improve and be more strict, but we have to grow to get
> there, not to flip the table now. I'm suggesting an exp(x) migration
> plan, not a sig(x).
>

I'm not suggesting flipping any tables. I'm suggesting having owners of
bots that aren't great/easily actionable do the first level triage, then
forward to the relevant contributors. This does not seem to be an
impossibly onerous request - is it? Is there something I'm missing about
this request being unreasonable?

>
>
> > Sure - some issues take a while to investigate. No doubt - but so long as
> > the issue is live (be it flaky or consistent) it's unhelpful (moreso if
> it's
> > flaky, given the way our buildbots send mail - though I still don't like
> a
> > red line on the status page, that's costly too) to have the bot red
> and/or
> > sending mail.
>
> Here, there are two issues:
>
> 1. Buildbots should not email on red->except->red. That's settled, and
> we must ignore those cases from now on, otherwise, we'll keep coming
> back at it. So, assume we don't do that any more.
>

Until that's fixed, again, I don't think it'd be unreasonable to switch
bots that tend ot be red for extended periods of time (& are thus more
prone to this problem) to be owner-triage-first.

> 2. If we agree that any flaky bot is turned off, and the master
> behaves correctly (as above), we cannot assume that the constant
> emailing during the investigation phase is due to flakyness. So, if
> you do get an email, it's probably a meaningful reason.
>

Sure - though I have a problem, to a lesser degree, with the buildbot
status page having red results for issues that are known & under
investigation. It would be better if that were not the case (if those bots
were XFAIL'd), but it doesn't relate to email notifications at all, which
is my bigger concern.

>
> We're not there yet, but we're discussing at a higher level here,
> dissecting the issue and finding the problems.
>
>
>
> > The issue is known and being investigated, sending other
> > people mail (or having it show up as red in the dashboard) isn't terribly
> > helpful. It produces redundant work for everyone (they all investigate
> these
> > issues - or learn to ignore them & thus miss true positives later) on the
> > project.
>
> Chris is investigating the Green Bot infrastructure, which is orders
> of magnitude better than our current. In that scenario, we'll have
> orders of magnitude less redundant work, even if you get a warning
> that you can't act on.
>
> --renato
>
> On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com> wrote:
> >
> >
> > On Fri, Oct 9, 2015 at 10:14 AM, Renato Golin <renato.golin at linaro.org>
> > wrote:
> >>
> >> I think we've hit a record in the number of inline replies, here... :)
> >>
> >> Let's start fresh...
> >>
> >>     Problem #1: What is flaky?
> >>
> >> The types of failures of a buildbot:
> >>
> >> 1. failures because of bad hardware / bad software / bad admin
> >> (timeout, disk full, crash, bad RAM)
> >
> >
> > Where "software" here is presumably the OS software, not the software
> under
> > test (otherwise all actual failures would be (1)), and not infrastructure
> > software because you've called that out as (2).
> >
> >>
> >> 2. failures because of infrastructure problems (svn, lnt, etc)
> >> 3. failures due to previous or external commits unrelated to the blame
> >> list (intermittent, timeout)
> >> 4. results that you don't know how to act on, but you have to
> >> 5. clear error messages, easy to act on
> >>
> >> In my view, "flaky" is *only* number 1. Everything else is signal.
> >
> >
> > I think that misses the common usage of the term "flaky test" (or do the
> > tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> > product code (hash ordering in the output).
> >
> >>
> >> I agree that bots that cause 1. should be silent, and that failures in
> >> 2. and 3. should be only emailed to the bot admin. But category 4
> >> still needs to email the blame list and cannot be ignored, even if
> >> *you* don't know how to act on.
> >
> >
> > & I disagree here - if most contributors aren't acting on these (for
> > whatever reasons, basically) we should just stop sending them. If at some
> > point we find ways to make them actionable (by having common machine
> access
> > people can use, documentation on how to proceed, short blame lists, etc -
> > whatever's getting in the way of people acting on these).
> >
> > And I don't think it's that people simply don't care about certain
> > architectures - We see Linux developers fixing Windows and Darwin build
> > breaks, for example. But, yes, more complicated things (I think a large
> part
> > of the problem is the temporal issue - no matter the architecture, if the
> > results are substantially delayed (even with a short blame list) and the
> > steps to reproduce are not quick/easy, it's easy for people to decide
> it's
> > not worth the hassle - which I think is something we likely have to live
> > with (again, lack of familiarity with a long/complex/inaccessible process
> > means that those developers really aren't in the best place to do the
> > reproduction/check that it was their patch that caused the problem)) do
> tend
> > to fall to bot owners/people familiar with that platform/hardware, and I
> > think that's totally OK/acceptable/the right thing.
> >
> >>
> >>
> >> Type 2. can easily be separated, but I'm yet to see how are we going
> >> to code in which category each failure lies for types 3. and 4.
> >
> >
> > Yeah, I don't have any .particular insight there either. Ideally I'd
> hope we
> > can ensure those issues are rare enough (though I've been seeing some
> > consistently flaky SVN behavior on my buildbot for the last few months,
> > admittedly - reached out to Tanya about it, but didn't have much to go
> on)
> > that it's probably not worth the engineering effort to filter them out.
> >
> >>
> >> One
> >> way to work around the problem in 4 is to print the bot owner's name
> >> on the email, so that you know who to reply to, for more details on
> >> what to do. How to decide if your change is unrelated or you didn't
> >> understand is a big problem.
> >
> >
> > What I'm suggesting is that if most developers, most of the time, aren't
> > able to determine this easily, it's not valuable email - if most of the
> time
> > they have to reach out to the owner for details/clarification, then we
> > should just invert it. Have the bot owner push to the contributor rather
> > than the contributor pull from the bot owner.
> >
> >>
> >> Once all bots are low-noise, people will
> >> tend more to 4, until then, to 3 or 1.
> >>
> >> In agreement?
> >>
> >>
> >>     Problem #2: Breakage types
> >>
> >> Bots can break for a number of reasons in category 4. Some examples:
> >>
> >> A. silly, quick fixed ones, like bad CHECK lines, missing explicit
> >> triple, move tests to target-specific directories, add an include
> >> file.
> >> B. real problems, like an assert in the code, seg fault, bad test
> results.
> >> C. hard problems, like bad codegen affecting self-hosting,
> >> intermittent failures in test-suite or self-hosted clang.
> >>
> >> Problems of type A. tend to show by the firehose on ARM, while they're
> >> a lot less common on x86_64 bots just because people develop on
> >> x86_64.
> >
> >
> > They show up often enough cross-OS and build config too (-Asserts,
> Windows,
> > Darwin, etc).
> >
> >>
> >> Problems B. and C. and equally common on all platforms due to
> >> the complexity of the compiler.
> >>
> >> Problems of type B. should have same behaviour in all platforms. If
> >> the bots are fast enough (either fast hardware, or many hardware), the
> >> blame list should be small and bisect should be quick (<1day).
> >
> >
> > Patches should still be reverted, or tests XFAIL - bots shouldn't be left
> > red for hours (especially in the middle of a work day) or a day.
> >
> >>
> >> These are not the problem.
> >>
> >> Problems of type C, however, are seriously worse on slow targets.
> >
> >
> > This can often/mostly be compensated for by having more hardware -
> > especially for something as mechanical as a bisect. (obviously once
> you're
> > in manual iterations, more hardware doesn't help much unless you have a
> few
> > different hypotheses you can test simultaneously)
> >
> > Certainly it takes some more engineering effort and there's overhead for
> > dealing with multiple machines, etc. But it's not linearly proportional
> to
> > machine speed, because some of it can be compensated for.
> >
> >>
> >> Not
> >> only it's slower to build (sometimes 10x slower than on a decent
> >> server), but the testing is hard to get right (because it's
> >> intermittent), and until you get it right, you're actively working on
> >> that (minus sleep time, etc). Since we're talking about an order of
> >> magnitude slower to debug, sleep time becomes a much bigger issue. If
> >> a hard problem takes about 5 hours on fast hardware, it can take up to
> >> 50 hours, and in that case, no one can work that long. If you do 10hs
> >> straight every day, it's still a week past.
> >
> >
> > Sure - some issues take a while to investigate. No doubt - but so long as
> > the issue is live (be it flaky or consistent) it's unhelpful (moreso if
> it's
> > flaky, given the way our buildbots send mail - though I still don't like
> a
> > red line on the status page, that's costly too) to have the bot red
> and/or
> > sending mail. The issue is known and being investigated, sending other
> > people mail (or having it show up as red in the dashboard) isn't terribly
> > helpful. It produces redundant work for everyone (they all investigate
> these
> > issues - or learn to ignore them & thus miss true positives later) on the
> > project.
> >
> >>
> >>
> >> In agreement?
> >>
> >>
> >> I'll continue later, once we're in agreement over the base facts.
> >>
> >> cheers,
> >> --renato
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151019/491e8d4e/attachment-0001.html>