[LLVMdev] Build bot fatigue

Sun Dec 29 15:59:31 PST 2013

My personal views (by which I always mean that I'm speaking as one of
the compiler engineers
employed by ARM but not officially on behalf of ARM):

On Sun, Dec 29, 2013 at 4:45 PM, dblaikie at gmail.com <dblaikie at gmail.com> wrote:
>
> On Saturday, December 28, 2013 6:05:38 PM, Alp Toker <alp at nuanti.com> wrote:
>
> My inbox has been filled with llvm.buildmaster at lab.llvm.org build
> failure notifications lately.
>
> The two problems appear to be:
>
>   1) Getting notifications for breakage that was introduced by an
> unrelated commit, often in a module I don't work on. Usually the
> original committer is working on or has already landed the necessary fix.
>
>   2) A cascade of dozens of notifications from various build servers
> that continue to flood in over the course of 24 hours after the issue
> was fixed.
>
> These two conflate and produce a high signal-to-noise ratio, and in
> practice you have to filter them out which means you no longer get a
> ping on your phone when you need it.
>
> Presumably a full fix is a non-trivial CI engineering problem, but are
> there simple measures get the situation back under control?
>
> Doesn't have to be perfect as long as it reduces the dozens of mails
> every day to something more manageable. Ideas:
>
>   1) Only send direct mail when the recipient is the single name in the
> blame list.

I think this would mean less-high-performance builders would never
signal their failures, which as explained below would be unfortunate.

>   2) Set an In-Reply-To header in order to thread all failure
> notifications related to a specific SVN revision. Most email clients
> will let you silence the thread once you've confirmed the issue has been
> resolved.

This sounds like a reasonable solution.

> 3) Or even simpler, don't send failure mail from any builders outside
> the "fast" set? Otherwise the important failures blocking everyone's
> work get drowned out in the noise.

I think it would certainly be helpful to separate out the builders into
a set which are sufficiently maintained and reliable to get an email
from when something breaks their build/tests, and a more "advisory"
set of builders (eg, there are some builders that appear to be have
borderline stability, often throwing up errors unrelated to the issues
under test). I think declaring only fast builders get to send emails would
have unfortunate effects in terms of testing native builds on
low-power architectures
(which will have a slower turn-around) but are otherwise quite reliable.
(ARM, my employer, spent quite a bit of effort fixing the ARM issues that
had crept in, work which for various reasons has transitioned to Linaro now.)
Modified in to that sense, this also seems a reasonable solution.

> This isn't new. Just how the boys have always worked.
>
> The biggest thing would be to move boots over to the phased builder
> infrastructure pioneered by apple (they use it internally and I believe most
> of it has been upstreamed by Daniel Dunbar and David Tweed) that sets up
> dependencies (eg: testing debug info depends on the compiler paying the
> basic check first) and refuse/caching of build product (eg: use the output
> of the basic checks to test the debug info, rather than rebuilding the
> compiler on every builder).

Just to note that I suspect it's someone else you're thinking of
regarding the phased
builder. (Although I did quite a bit of work on the ARM buildbots late last year
I haven't been involved in the phased builder work.)

-- 
cheers, dave tweed__________________________
high-performance computing and machine vision expert: david.tweed at gmail.com
"while having code so boring anyone can maintain it, use Python." --
attempted insult seen on slashdot