[llvm-dev] Buildbot Noise

Wed Oct 7 16:16:18 PDT 2015

On 7 October 2015 at 23:54, Eric Christopher <echristo at gmail.com> wrote:
> Basic stability guarantee:
> "Only returns failure for failures due to the compiler or the occasional
> exception"

Ok, in that sense, my bots are very stable.

> I don't know how fast/slow comes into this. See Chris's mail for more
> comments on this. I think you're concentrating too hard on this particular
> axis to the detriment of the discussion. I think a better way is to look at
> it as "signal to noise" ratio.

Chris' CI is orders of magnitude better than ours. In his
infrastructure, speed is a lot less relevant when waiting a fix/revert
to work (make it green).

I do agree with almost everything else except one thing: We had three
pandas to solve the speed issue. But more often than I'd like, they'd
pick three consecutive commits and keep dozens of commits waiting.
That makes the value of maintaining more bots, smaller.

> If it's mostly red due to:
> a) instability (exceptions, timeouts, what have you), or
> b) no one looking at the failures, or
> c) can't complete fast enough to deal with the transient red in top of tree

Absolutely agree. None of that apply to our bots.

We don't have instability issues any more for a long time. As I said,
Pandas are gone, Junos are fixed. The rest is very stable.

We're *always* looking at failures, but sometimes it takes time to
figure out what to revert, and sometimes there's no test to XFAIL.
These take longer to fix.

> Are they red because the tree is red over their run lifetime or red because
> there are problems that aren't being fixed?

The two examples where I was asked to disable my bots were similar.

There were two separated instances in two separated weeks where a
self-hosting bot would spot a weird bug but not the others. Marking
the test as XFAIL is not an option, otherwise all the other bots would
then fail.

So, we tried to understand what was going on, but our hardware is
mostly remote and shared, so it took days to get to an idea. Then, we
needed to mark it as unstable, and wait for it to go back to green.
All that took about 5 days, including the weekend, so in reality, 3
working days. I don't find that flaky, nor unreasonable, nor
unsustainable.

However, during those 5 days, the build master was restarted, and the
bot status went from red to exception and back to red. Since, as I
explained earlier, exception is treated as "success", David got an
email, looked that it was red for "a long time" and assumed no one was
looking at it.

By coincidence, this happened twice in a row for completely different
reasons and David was emailed twice in two weeks. That's when he
assumed the bot was flaky and no one was trying to fix them.

> Seems reasonable. If you're getting actual failures then that seems like
> something reasonable. If you're not trying to get them fixed by getting
> testcases or helping people get a problem that they can see then it may mean
> that since the owner doesn't care then no one does :)

I'm always trying to fix every bug we find. I've always helped
everyone. I've even provided access to our hardware on multiple
occasions when I wasn't able to debug the problem myself.

I worked very hard to reduce the noise of our hardware, and I managed
to get some pretty stable buildbots. That's why I was so shocked when
I was asked to disable my bots twice!

> Honestly I'm not sure if redundant builders are the solution here, but
> rather the phased system. Basically more noise (e.g. they're all going to
> fail) isn't going to help. That said, if they help you reduce time to find
> problems then it's great.

I described some of those problems above, so I agree with you.

Moving to something like the GreenBots seem like the best option.

cheers,
--renato