generally sounds good, though I'm not sure how much structure is needed for
many of these changes - in the past I've made a point of pushing back on
red bots or bots sending fail mail that's unhelpful (noisy, frequent
unactionable failures, especially long blame lists, etc) & removing bots if
needed & I encourage everyone to do more of that (either a small dedicated
group, or the community at large) for instance.

On Thu, Nov 7, 2019 at 12:09 PM Nico Weber via cfe-dev <
cfe-dev at lists.llvm.org> wrote:

> Hi,
> during LLVM conf we had a roundtable discussing the state of buildbot.
> Here are the notes of what we discussed.
> The summary is that there's lots of appetite for improving the state of
> LLVM's infra, with lots of good shorter and longer term ideas (see below).
> Several people were interested in starting an (open for all who are
> interested) "LLVM infra team", with possibly a dedicated mailing list, and
> with possibly the powers to make infra changes with just consensus from
> people on that infra team.

FWIW, we did have a lab/infrastructure mailing list many years ago & not
much happened with it - perhaps this one will be different, but I don't
think it will necessarily create more authority. Invested parties, I think,
would be as well to propose things on the usual mailing lists & do the work
to make these changes - making them without community buy-in will be
reasonably pushed back against in either form.

But if it helps to have a common group to get involved in these
discussions, that's good - so they don't die on the mailing list with no
discussion/progress (we do this in the debug info area by having an
informal grouping - most mailings go out to the list + the usual folks who
are interested in that area).

> (Sorry for the delayed email, I wrote this up right after the meeting but
> forgot to hit "Send".)
> The actual notes:
> Problems with buildbot
> - console view loads slowly
> - many bots take a long time to cycle
> - many bots are perma-red

Some bots are configured in the "Do not send email"/for bot maintainers to
maintain/triage/etc, which I encourage - if that's insufficiently called
out in the UI & making the console hard to read, yeah, it'd be great to
make that grouping more clear or if it's impractical to do so, perhaps just
saying that use case is not supported on the primary buildbot instance &
those folks can run their own CI infrastructure entirely?

> - test output on some bots is huge due to the bots printing all tests, not
> just failing ones, making it difficult to see failing tests
> - it's sometimes difficult to reproduce failures on the bots locally
> Possible improvements
> - display better machine info on all builders (OS, host compiler with
> detailed version, binutils version, cmake version, ninja version,
> kernel/userspace bitness)
> - require bots to use a cmake cache file, for easy local matching
> execution?
> - remove perma-red bots
> - remove slow bots
> -- or put on faster hw, llvm foundation has funds

Does it have enough funds for significant investment here? What would that
look like (what are the current gaps? How much would they cost to fill? In
what sort of priority ordering?).

> --- what about slow boards?
> ---- decouple build and test phases?
> ---- shard tests over multiple devices?

Yep - Apple internally (maybe externally on green dragon) had/has some form
of tiered buildbot infrastructure - eg: stage 1 build result is a separate
"builder" but its output is used by stage 2/bootstrapping builders, and
test-suite builders, etc. So there's less redundant work and redundant bot
spam. Something like that would be lovely to haev - but someone's got to
invest in building/maintaining/etc it & so far no one has - that's what
this has mostly come down to in the past: lots of things people would like,
but no one signing up to build it, maintain it, etc. If you've got folks
with teh time/resources, yeah - lots of things on this list & general
community desire for build infrastructure can be done.

My ideal would be a tiered build flow as described above, with a time
window threshold goal (eg: any bot that sends mail must have a way to keep
its cycle period to less than an hour, say - that might mean if it
necessarily runs 2 hours of testing, it has twice the infrastructure so
hourly snapshots can be taken and tested without falling behind) - which
could be achieved with either more hardware, or narrower testing as suited
to the particular scenario. Tiering keeps the redundant noise down, window
threshold keeps the mails targeted/relatively actionable by the recipient.
Also flakiness tolerance should be low. If any of thoes 3 criteria can't be
met, it should be up to the party interested in that workload to maintain
the bots, triage the failures, and manually send mail that conforms to
those 3 criteria (even if it takes a human 3 days to investigate a failure
- if that failure has a low false positive rate and is actionable by the
person it goes to (narrow blame list & good reproduction steps) then I
think that's golden - it does mean if it's 3 days later a revert might not
be immediately viable (it often is viable, though))

> - make fast bots trigger slow bots, only when fast builds are successful,
> for fewer emails
> - have support tier lists?
> -- e.g. tier 0 pledges bots that cycle in < 15 min, in return are on tier
> 0 waterfall and can revert breakages after ~ 15 min
> -- tier 1 pledges bots that cyclle in < 1 day, can revert breakages after
> 1 day

I think the general rule is if you have reproduction steps & it's a
supported scenario-  you can revert immediately. I'm not sure the value in
waiting a day to revert because your bot takes a while to cycle (though I
don't think we have any bots that have a 24 hour cycle time, do we? I guess
if it's a 12 hour cycle you could end up, at worst, 24 hours from patch
submission to result)

> - update buildbot to current version?
> -- lots of api changes
> - have pre-commit tests
> -- kuhnel has prototype for this on linux, will send separate
> announcement, positive reception
> --- several requests to have the same for win
> - move build off buildbot to github actions?
> -- jyknight has prototype, works great, except that custom hardware isn't
> (yet?) supported, so cycle times are prohibitively long
> - have a dedicated llvm infra team
> -- dedicated llvm-infra mailing list
> -- and group of deciders with llvm foundation's blessing?
> - have a buildbot view that shows only red bots?
> - have an in-tree script for setting up a build + prereqs (eg. new-enough
> host gcc, gnuwin tools on win, new enough cmake, etc)?

Can be handy - but also can have a large support surface (what platforms
would that be supported on?).

> - current bots don't cover multi-config cmake generators (ie Xcode, msvc
> before 2019)
> -- explicitly say we don't support those? would allow some cleanups
> --- msvc 2019 cmake support generates ninja builds for both debug and
> release and calls ninja for the actual build
> --- maybe do something similar for xcode?

That seems a bit orthogonal to the rest of the discussion, maybe more
suited to the cmake update conversation/thread (though tangential there
too, perhaps).

