[llvm-dev] buildbot failure in LLVM on clang-cmake-thumbv7-a15-full-sh

Tue Sep 29 11:04:27 PDT 2015

On Tue, Sep 29, 2015 at 10:56 AM, Renato Golin <renato.golin at linaro.org>
wrote:

> On 29 September 2015 at 18:41, David Blaikie <dblaikie at gmail.com> wrote:
> > Is it? While it's failing, the buildbot doesn't seem to be any use to the
> > community at large - it's essentially the buildbot owners problem at that
> > point and probably shouldn't be engaging with the community until it's
> green
> > again, I think?
>
> The bot is useful as it still shows if there are new bugs since the
> initial problem, and can help bisect any further problem when they
> come. If we disable that bot, when we fix the issue and bring it back,
> there could be a number of new failures that we didn't monitor and
> that will need a few more days/weeks to remove, especially if they're
> cumulative. This way, it's likely that we'll never have that bot
> online ever again. This is bad for the community.
>

The community generally doesn't pay attention to the bot once it goes red -
so this seems to be only relevant to the "we didn't monitor" and by "we"
I/you mean you-and-other-people-who-care-about-the-bot, not the community
at large.

I certainly don't look beyond "oh, the bot was already red" and /maybe/ if
you're lucky "oh, a different thing is failing now", but I often don't get
that far owing to the high false positive rate (due to flakes and existing
errors) in the buildbots.

Maybe other people's experiences are different, but I don't have much
evidence to suggest that.

> > Is the buildbot useful to you during this time? Or are you debugging
> > elsewhere/privately?
>
> Both. As I described above, this bot is useful not just to me, but the
> community, as they can cross check if their commits introduced bugs to
> all ARM bots, not just one, and the slow bot will show that.

I don't know about other people, but I don't cross reference bots that
closely. I mostly ignore the low rumble of noise I get back from the
buildbots every time I commit. I have to measure by magnitude (& level of
trust with different bots) this is really not possible for newer
contributors - they won't know what to pay attention to or not. I don't
think it's a sustainable way to run the bots.

> I'm also
> investigating elsewhere, since if I turn this bot off, what I said
> above will happen. I'm also not alone in investigating this, Saleem is
> helping me.
>
>
> > If the buildbot is useful to you, but not the community at large -
> perhaps
> > we could get in the habit of moving it into a "no email" pool whenever a
> > failure occurs, until it can be cleared up. (hopefully this pool is
> clearly
> > distinguished from the rest of the buildbots in the waterfall/grid view -
> > because it'd be helpful to be able to look at an easily distinguished
> subset
> > of the waterfall/grid and see the bots that are expected to be green for
> any
> > developer there)
>
> Any movement means restarting the buildmaster, which means stopping
> all current builds and upsetting all other bots. If we start taking
> the stance of moving things up and down the priority list, we'll have
> more unstable buildbots and that's worse for the community. Our
> agreement, at least from what I understood, was that we should move
> unstable bots to offline if: they're broken for a while AND no one is
> trying to or can fix it. "A while" is vague because it depends on the
> hardware, and I'm definitely trying to fix it.
>
> It's not because the hardware is slow that it has no value to the
> community, unless you're arguing that we shouldn't test ARM at all,
> which is a whole different story.
>

If the failure mails are not actionable, they're not useful to the
community. If the blame list is too long (or too delayed) it's not likely
to be useful.

If a certain platform just takes a long time (though we could reduce that
with a hybrid approach - cross build the compiler on a fast platform, run
the tests on the other) then it's necessary to put more hardware (multiple
slaves) behind it to reduce the blame lists, I think.

> Not emailing bugs in this bot when it's green means it's probably
> useless,

It doesn't seem useless - it's still a signal to you and other developers
who care about the platform and will investigate failures.

> so I wouldn't want to have any bots in there. I already have
> a separate buildmaster which doesn't email where I test my prototypes,
> but those are work in progress, while my production bots are not.
>
> A neater solution would be to not email *any* buildbot that moves from
> exception to failure if the previous non-exceptional status is also
> failure. This way, we won't have the kind of email that upset you, but
> we still have the value that a red bot provides.
>

Sure, I'd be OK-ish with that, though it'd still make looking at the
waterfall/grid problematic as it is today (though I don't do that often, so
I don't personally care about that). It'd be the same as moving the
buildbot to a "no email" group until fixed, but without the need to cycle
the buildmaster (& with the benefit that it'd happen automatically - though
I'm only suggesting moving it off emailing when there's active
investigation, so the small manual task at the beginning and end of that
cycle doesn't seem too detrimental - no need to do it when someone just
checks in a buildbreak by mistake, etc)

- Dave

>
> cheers,
> --renato
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150929/9d08a0dd/attachment.html>