[LLVMdev] [cfe-dev] ARM failures

Tue Jan 8 09:39:43 PST 2013

On Tue, Jan 8, 2013 at 9:25 AM, David Tweed <David.Tweed at arm.com> wrote:
>> Good point. The build bot is broken for a while and I assumed the person who
>> did that commit would spot it better than I would,
>
> |If the bot isn't configured to send fail-mail to the blame list,
> |people probably won't notice. That's how the buildmaster/bots ended up
> |in the rather multicolored state they are in now.
>
> I think what happens from the buildbots depends on how many commits since the last build that succeeded. During peak commit time (working hours in the US) there can be 10-15 commits between builds. (Conversely it's not too unusual to see 1 commit between builds early in the morning UK time.) I think automated emails are generally only enable for bots where the average commits to be blamed is lower. Otherwise it's manual analysis, but a couple of times I've received emails from Galina when I've committed something that's increased the failures.
>
>> but I shouldn't have
>> assumed that the person would receive my email.
>
>> I'll try to point the commit
>> and re-send, copying the author.
>
> |Specifically replying to the -commits mailing that committed the break
> |is the most useful - it provides the context & keeps the discussion
> |close to the code that it's related to.
>
> Yes, although there are occasionally instances when there's multiple commits that break tests they don't touch so it's non-obvious what's responsible.

Indeed - that's part of the reason why builders need owners who care
about them. I think it's always going to be up to the owners to
investigate in a situation like this where any individual contributor,
not being on/having access to/personally being invested in the
builder, can't really be expected to go out of their way to sift
through commits & decide whether they're to blame. In cases like that
each individual will just assume it's "not their problem" so it must
fall to someone to ensure it doesn't just get dropped on the floor.

The owner should be on any fail-mail thread and, if the issue is not
addressed in a timely manner, should take steps to ensure that the
responsible party is identified (& made aware) and unblocked (clear
repro steps - usually with regression tests like LLVM's, this can be
done by anyone simply by specifying the relevant target triple &
watching the failure - no need to have access to special hardware,
etc). The owner can either fix it themselves (add an explicit triple,
generalize a CHECK line, etc) or wait a reasonable amount of time
(where reasonable depends on the issue, time of day, etc) for a fix
from the author. If no fix is forthcoming, it's not unreasonable to
revert the patch to get the builder back to green.

This needs to be how things happen or bots end up red for too long &
then the buildmaster page is useless as a clear sense of "is Clang
broken" (because, hey, it's always 'broken' - & people won't know
which builders matter & which ones don't).

>
>> Sorry for the noise.
>
> | Not a problem. Good that people are looking at these things (& I've
> | done the same thing you've done here in the past - because I had no
> | idea what broke & I wanted to see if anyone had ideas/cared).
>
> I think the biggest issue is that if a committer is unlucky (commit just after a buildbot kicks off) it can be 2.25+2.25=4.5 hours (due to two build cycles) before the buildbot turns red. I wish I had a magic suggestion to cure that, but I can't think of any.

Certainly slow builders are problematic. The phase-based building
system David Dean is setting up may help mitigate some of this (it
should make better use of the resources we have, as well as allowing
us to benefit (in the form of smaller blame lists, though not
necessarily lower buildbot result latency) from additional resources
by allowing greater parallelism).

Even at 4.5 hours of turnaround, we don't break these things /that/
often that a builder broken even for a whole day is the end of the
world. It's the builders broken for weeks & weeks (well beyond the
history/backlog on the build master's console page) that I think we
should seek to avoid/resolve. That being said, yes, shorter turnaround
& more fine-grained blame would be great.

- David