[llvm-dev] [cfe-dev] Buildbot Noise

Fri Oct 16 07:17:55 PDT 2015

Not to distract from the truly worthwhile discussion going on here,
but let me bring up one notion that I think buildbot currently doesn't
support:

Our internal build/test system can distinguish "has new failure(s)" from
"failed but no new failures" and represent those things differently
on our dashboard.  In public-bot terms this would mean saving the most
recent list of test failures, comparing to the new set of test failures,
and having a different failure-state if the new set is equal to or a
proper subset of the previous set.  This might ameliorate an ongoing-red
situation, as a no-new-fails state wouldn't send blame mail.  But if
there are new fails, the blame mailer can do a set-difference and report
only the new ones. That would reduce the noise a bit, hmm?

--paulr

> -----Original Message-----
> From: cfe-dev [mailto:cfe-dev-bounces at lists.llvm.org] On Behalf Of Renato
> Golin via cfe-dev
> Sent: Saturday, October 10, 2015 5:00 AM
> To: David Blaikie
> Cc: LLVM Dev; Galina Kistanova; Clang Dev
> Subject: Re: [cfe-dev] Buildbot Noise
> 
> On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com> wrote:
> > Where "software" here is presumably the OS software
> 
> Yes. This is the real noise, one that we cannot accept.
> 
> 
> > I think that misses the common usage of the term "flaky test" (or do the
> > tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> > product code (hash ordering in the output).
> 
> Flaky code, either compiler or tests, are the ones that don't fail in
> the correct blame list. Otherwise, even if it was flaky, we don't
> know, because it failed in the right blame list, so it's easy to
> revert or XFAIL.
> 
> So, in my categorisation, flaky code ends up in either 3 or 4:
> 
> 3, wrong blame list: if the failure is completely independent from the
> blame list, example, misuse of the C++ ABI.
> 4, related, but not directly: if the failure is related, but in ways
> that the patch didn't touch, example, changing related debug info for
> a non-debug patch.
> 
> These can be that the original code didn't cope with this future
> change, but the change is semantically valid, or the test CHECK lines
> were poor (like naming explicit registers, etc), and that's why the
> tests broke. The former is harder for the blamed developer to fix, but
> "git blame" can help find the one to help. The latter is a lot easier
> to spot and fix, but is also helped by "git blame". Both actionable,
> but not immediately obvious.
> 
> 
> > & I disagree here - if most contributors aren't acting on these (for
> > whatever reasons, basically) we should just stop sending them. If at
> some
> > point we find ways to make them actionable (by having common machine
> access
> > people can use, documentation on how to proceed, short blame lists, etc
> -
> > whatever's getting in the way of people acting on these).
> 
> I see, your disagreement is temporal.
> 
> You're basically saying that, because people ignore them today,
> there's no point in sending them the email today, and it's up to the
> bot owners to make people start paying attention to their bots.
> 
> My argument is that I cannot make you care, no matter how stable my
> bots are. And the evidence for that is that my bots are very stable,
> but you're ignoring them, either because you don't understand what a
> flaky bot is, or just out of principle.
> 
> My bots don't have hardware or OS problems, nor they timeout or run
> out of disk for a good number of years. But I can't stop bad testing,
> or bad coding. And, as I've outlined too many times, these affect bots
> like mine more heavily than others. It's the nature of the failures
> plus the nature of my hardware.
> 
> I can't make you care about it, so I don't mind if you ignore them,
> but I *do* mind if you want to shut them off.
> 
> 
> > And I don't think it's that people simply don't care about certain
> > architectures - We see Linux developers fixing Windows and Darwin build
> > breaks, for example. But, yes, more complicated things (I think a large
> part
> > of the problem is the temporal issue - no matter the architecture, if
> the
> > results are substantially delayed (even with a short blame list) and the
> > steps to reproduce are not quick/easy, it's easy for people to decide
> it's
> > not worth the hassle
> 
> I think that's an appalling behaviour for a community.
> 
> 
> > - which I think is something we likely have to live
> > with (again, lack of familiarity with a long/complex/inaccessible
> process
> > means that those developers really aren't in the best place to do the
> > reproduction/check that it was their patch that caused the problem)) do
> tend
> > to fall to bot owners/people familiar with that platform/hardware, and I
> > think that's totally OK/acceptable/the right thing.
> 
> Hum, ok. There are two sides here.
> 
> 1. You do care, but can't do anything. In this case, you work with the
> owner to resolve the problem, even if the owner does all the work.
> 
> 2. You don't care, and ignore the failure. Here the bot owner has to
> find out on his own and do all the work.
> 
> The first is perfectly acceptable, and I'm more than happy to do all
> the work. The second I normally just revert the patch without asking.
> 
> 
> > What I'm suggesting is that if most developers, most of the time, aren't
> > able to determine this easily, it's not valuable email - if most of the
> time
> > they have to reach out to the owner for details/clarification, then we
> > should just invert it. Have the bot owner push to the contributor rather
> > than the contributor pull from the bot owner.
> 
> The LLVM project has hundreds of committers, dozens of bots have a
> single owner. How does that scale?
> 
> I think this proposal is against the very nature of open source
> projects in general and a horrible engineering decision. I have
> noticed that recently some people have taken the attitude that "if you
> can't keep up with my commits, you're not worth noticing", and that's
> the attitude that will get us forked.
> 
> 
> > They show up often enough cross-OS and build config too (-Asserts,
> Windows,
> > Darwin, etc).
> 
> Ok, good.
> 
> 
> > Patches should still be reverted, or tests XFAIL - bots shouldn't be
> left
> > red for hours (especially in the middle of a work day) or a day.
> 
> How do you XFAIL a Clang miscompilation of Clang?
> 
> How do you revert a failure that is unrelated to the blame list
> because they're from previous or external commits?
> 
> 
> > This can often/mostly be compensated for by having more hardware -
> 
> Throw money at the problem? :D
> https://www.youtube.com/watch?v=CZmHDEa0Y20
> 
> 
> > especially for something as mechanical as a bisect. (obviously once
> you're
> > in manual iterations, more hardware doesn't help much unless you have a
> few
> > different hypotheses you can test simultaneously)
> 
> I don't have infinite hardware, nor infinite space, nor infinite
> power, nor infinite time.
> 
> Certain things take longer than others, and people that are used to
> getting them fast have a lower tolerance for slow(er) processes. Fast
> and slow are completely arbitrary and relative to how slow or fast
> things are between themselves.
> 
> 
> > Certainly it takes some more engineering effort and there's overhead for
> > dealing with multiple machines, etc. But it's not linearly proportional
> to
> > machine speed, because some of it can be compensated for.
> 
> Right. So, here, I agree with you. It IS possible to improve and make
> it much better.
> 
> I'm working on making it better, but it takes time. I can't make it
> work tomorrow, and that's my original point:
> 
> We have to improve and be more strict, but we have to grow to get
> there, not to flip the table now. I'm suggesting an exp(x) migration
> plan, not a sig(x).
> 
> 
> > Sure - some issues take a while to investigate. No doubt - but so long
> as
> > the issue is live (be it flaky or consistent) it's unhelpful (moreso if
> it's
> > flaky, given the way our buildbots send mail - though I still don't like
> a
> > red line on the status page, that's costly too) to have the bot red
> and/or
> > sending mail.
> 
> Here, there are two issues:
> 
> 1. Buildbots should not email on red->except->red. That's settled, and
> we must ignore those cases from now on, otherwise, we'll keep coming
> back at it. So, assume we don't do that any more.
> 
> 2. If we agree that any flaky bot is turned off, and the master
> behaves correctly (as above), we cannot assume that the constant
> emailing during the investigation phase is due to flakyness. So, if
> you do get an email, it's probably a meaningful reason.
> 
> We're not there yet, but we're discussing at a higher level here,
> dissecting the issue and finding the problems.
> 
> 
> 
> > The issue is known and being investigated, sending other
> > people mail (or having it show up as red in the dashboard) isn't
> terribly
> > helpful. It produces redundant work for everyone (they all investigate
> these
> > issues - or learn to ignore them & thus miss true positives later) on
> the
> > project.
> 
> Chris is investigating the Green Bot infrastructure, which is orders
> of magnitude better than our current. In that scenario, we'll have
> orders of magnitude less redundant work, even if you get a warning
> that you can't act on.
> 
> --renato
> 
> On 9 October 2015 at 19:02, David Blaikie <dblaikie at gmail.com> wrote:
> >
> >
> > On Fri, Oct 9, 2015 at 10:14 AM, Renato Golin <renato.golin at linaro.org>
> > wrote:
> >>
> >> I think we've hit a record in the number of inline replies, here... :)
> >>
> >> Let's start fresh...
> >>
> >>     Problem #1: What is flaky?
> >>
> >> The types of failures of a buildbot:
> >>
> >> 1. failures because of bad hardware / bad software / bad admin
> >> (timeout, disk full, crash, bad RAM)
> >
> >
> > Where "software" here is presumably the OS software, not the software
> under
> > test (otherwise all actual failures would be (1)), and not
> infrastructure
> > software because you've called that out as (2).
> >
> >>
> >> 2. failures because of infrastructure problems (svn, lnt, etc)
> >> 3. failures due to previous or external commits unrelated to the blame
> >> list (intermittent, timeout)
> >> 4. results that you don't know how to act on, but you have to
> >> 5. clear error messages, easy to act on
> >>
> >> In my view, "flaky" is *only* number 1. Everything else is signal.
> >
> >
> > I think that misses the common usage of the term "flaky test" (or do the
> > tests themselves end up other (1) or (2)?) or flaky tests due to flaky
> > product code (hash ordering in the output).
> >
> >>
> >> I agree that bots that cause 1. should be silent, and that failures in
> >> 2. and 3. should be only emailed to the bot admin. But category 4
> >> still needs to email the blame list and cannot be ignored, even if
> >> *you* don't know how to act on.
> >
> >
> > & I disagree here - if most contributors aren't acting on these (for
> > whatever reasons, basically) we should just stop sending them. If at
> some
> > point we find ways to make them actionable (by having common machine
> access
> > people can use, documentation on how to proceed, short blame lists, etc
> -
> > whatever's getting in the way of people acting on these).
> >
> > And I don't think it's that people simply don't care about certain
> > architectures - We see Linux developers fixing Windows and Darwin build
> > breaks, for example. But, yes, more complicated things (I think a large
> part
> > of the problem is the temporal issue - no matter the architecture, if
> the
> > results are substantially delayed (even with a short blame list) and the
> > steps to reproduce are not quick/easy, it's easy for people to decide
> it's
> > not worth the hassle - which I think is something we likely have to live
> > with (again, lack of familiarity with a long/complex/inaccessible
> process
> > means that those developers really aren't in the best place to do the
> > reproduction/check that it was their patch that caused the problem)) do
> tend
> > to fall to bot owners/people familiar with that platform/hardware, and I
> > think that's totally OK/acceptable/the right thing.
> >
> >>
> >>
> >> Type 2. can easily be separated, but I'm yet to see how are we going
> >> to code in which category each failure lies for types 3. and 4.
> >
> >
> > Yeah, I don't have any .particular insight there either. Ideally I'd
> hope we
> > can ensure those issues are rare enough (though I've been seeing some
> > consistently flaky SVN behavior on my buildbot for the last few months,
> > admittedly - reached out to Tanya about it, but didn't have much to go
> on)
> > that it's probably not worth the engineering effort to filter them out.
> >
> >>
> >> One
> >> way to work around the problem in 4 is to print the bot owner's name
> >> on the email, so that you know who to reply to, for more details on
> >> what to do. How to decide if your change is unrelated or you didn't
> >> understand is a big problem.
> >
> >
> > What I'm suggesting is that if most developers, most of the time, aren't
> > able to determine this easily, it's not valuable email - if most of the
> time
> > they have to reach out to the owner for details/clarification, then we
> > should just invert it. Have the bot owner push to the contributor rather
> > than the contributor pull from the bot owner.
> >
> >>
> >> Once all bots are low-noise, people will
> >> tend more to 4, until then, to 3 or 1.
> >>
> >> In agreement?
> >>
> >>
> >>     Problem #2: Breakage types
> >>
> >> Bots can break for a number of reasons in category 4. Some examples:
> >>
> >> A. silly, quick fixed ones, like bad CHECK lines, missing explicit
> >> triple, move tests to target-specific directories, add an include
> >> file.
> >> B. real problems, like an assert in the code, seg fault, bad test
> results.
> >> C. hard problems, like bad codegen affecting self-hosting,
> >> intermittent failures in test-suite or self-hosted clang.
> >>
> >> Problems of type A. tend to show by the firehose on ARM, while they're
> >> a lot less common on x86_64 bots just because people develop on
> >> x86_64.
> >
> >
> > They show up often enough cross-OS and build config too (-Asserts,
> Windows,
> > Darwin, etc).
> >
> >>
> >> Problems B. and C. and equally common on all platforms due to
> >> the complexity of the compiler.
> >>
> >> Problems of type B. should have same behaviour in all platforms. If
> >> the bots are fast enough (either fast hardware, or many hardware), the
> >> blame list should be small and bisect should be quick (<1day).
> >
> >
> > Patches should still be reverted, or tests XFAIL - bots shouldn't be
> left
> > red for hours (especially in the middle of a work day) or a day.
> >
> >>
> >> These are not the problem.
> >>
> >> Problems of type C, however, are seriously worse on slow targets.
> >
> >
> > This can often/mostly be compensated for by having more hardware -
> > especially for something as mechanical as a bisect. (obviously once
> you're
> > in manual iterations, more hardware doesn't help much unless you have a
> few
> > different hypotheses you can test simultaneously)
> >
> > Certainly it takes some more engineering effort and there's overhead for
> > dealing with multiple machines, etc. But it's not linearly proportional
> to
> > machine speed, because some of it can be compensated for.
> >
> >>
> >> Not
> >> only it's slower to build (sometimes 10x slower than on a decent
> >> server), but the testing is hard to get right (because it's
> >> intermittent), and until you get it right, you're actively working on
> >> that (minus sleep time, etc). Since we're talking about an order of
> >> magnitude slower to debug, sleep time becomes a much bigger issue. If
> >> a hard problem takes about 5 hours on fast hardware, it can take up to
> >> 50 hours, and in that case, no one can work that long. If you do 10hs
> >> straight every day, it's still a week past.
> >
> >
> > Sure - some issues take a while to investigate. No doubt - but so long
> as
> > the issue is live (be it flaky or consistent) it's unhelpful (moreso if
> it's
> > flaky, given the way our buildbots send mail - though I still don't like
> a
> > red line on the status page, that's costly too) to have the bot red
> and/or
> > sending mail. The issue is known and being investigated, sending other
> > people mail (or having it show up as red in the dashboard) isn't
> terribly
> > helpful. It produces redundant work for everyone (they all investigate
> these
> > issues - or learn to ignore them & thus miss true positives later) on
> the
> > project.
> >
> >>
> >>
> >> In agreement?
> >>
> >>
> >> I'll continue later, once we're in agreement over the base facts.
> >>
> >> cheers,
> >> --renato
> >
> >
> _______________________________________________
> cfe-dev mailing list
> cfe-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev