<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p><br>
</p>
<div class="moz-cite-prefix">On 1/13/22 1:41 PM, Stella Stamenova
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style>@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face
{font-family:"Segoe UI Emoji";
panose-1:2 11 5 2 4 2 4 2 2 3;}@font-face
{font-family:Consolas;
panose-1:2 11 6 9 2 2 4 3 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}pre
{mso-style-priority:99;
mso-style-link:"HTML Preformatted Char";
margin:0in;
font-size:10.0pt;
font-family:"Courier New";}p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
{mso-style-priority:34;
margin-top:0in;
margin-right:0in;
margin-bottom:0in;
margin-left:.5in;
font-size:11.0pt;
font-family:"Calibri",sans-serif;}span.HTMLPreformattedChar
{mso-style-name:"HTML Preformatted Char";
mso-style-priority:99;
mso-style-link:"HTML Preformatted";
font-family:Consolas;}span.EmailStyle21
{mso-style-type:personal-reply;
font-family:"Calibri",sans-serif;
color:windowtext;}.MsoChpDefault
{mso-style-type:export-only;
font-size:10.0pt;
font-family:"Calibri",sans-serif;}div.WordSection1
{page:WordSection1;}ol
{margin-bottom:0in;}ul
{margin-bottom:0in;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal">There are a couple of things on this thread
that sound nice in general, but have not been clarified either
in the discussion or in the documentation. Since the devil is
in the details, I’d like to see us agree on the details and
then have them added to the documentation.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><b><i>At the end of the day, there should
be no surprises in the process and everything that can be
should be quantified.<o:p></o:p></i></b></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">We want to encourage people to be
responsible code and buildbot owners, not discourage them from
contributing at all.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">> It is expected that buildbot owners
own bots which are reliable, informative and helpful to the
community.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">In my experience, every buildbot has
occasional “flakiness” – be it because of code failures that
don’t happen every time or because of connectivity issues,
etc. Some bots are also often broken not because of any
flakiness, but because with the large number of commits, there
are bound to be failures.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">So what makes a bot not reliable enough?
Some percentage of builds failing? Some percentage of false
positives? Does it vary per project or is there a single
expectation for all of llvm?<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">I think it makes sense to say that false
positives above a certain threshold make a buildbot not
reliable enough and the threshold should be documented. It
also makes sense to say that failures above a certain
threshold make a bot not reliable enough – if the codebase is
fragile enough that most commits cause breaks, it is possible
that a reliable buildbot for it cannot exist.</p>
</div>
</blockquote>
<p>This is a hard thing to specify, but I'm going to take a shot at
some draft wording.</p>
<p>We generally expect that publicly notifying builders are stable -
meaning they do not report failures unless those failures are
related to the commit being built. Note that our requirement here
is specific to notification, not the existence of the builder on
the waterfall. <br>
</p>
<p>In general, we expect a buildbot to be able to report an average
of no more than one false positive failure per day. We will
sometimes allow bots with higher failure rates due to special
circumstances - e.g. unstable hardware combined with limited
hardware availability for a platform - but these exceptions are
just that: exceptions. They need to be widely discussed before
such a bot is allowed to notify, and the build config must make it
apparent to casual users that the bot may be unstable. <br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
<div class="WordSection1">
<p class="MsoNormal"><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">> "someone" needs to take the
responsibility to backstop the normal revert to green process.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">As Mehdi pointed out earlier, the root
cause of the failure might mean that the buildbot owner or
that a code owner is better suited to addressing it. Philip’s
argument is that at the end of the day, it is always the
buildbot owner if a code owner hasn’t come forward. It makes
sense to have someone who is ultimately responsible and it
also makes sense that everyone needs to be given time and
notice to act on the failures.
<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">There has also been some mention of
different ways to “silence” a buildbot – either by turning it
off entirely and waiting for a bot owner to reconnect it to
staging or production, or by tagging it as “silent”. In my
experience, there’s a huge difference between using the
“silent” tag and turning a bot off. In the first case, the bot
owners will continue to receive notifications and the builds
will continue to run. Even if the bot is red already, there’s
some chance that new commits that add breaks will be possible
to figure out by looking at the logs either by other
interested parties, or by the bot owners themselves. When a
bot is turned off for any period of time, there’s nothing that
can be used to determine when new failures were checked in
(aside from local builds, so many local builds) and it can be
incredibly painful to track down<b><i>. I think bots should
only be forcefully turned off very rarely and when nothing
else can be done and with plenty of notice.</i></b></p>
</div>
</blockquote>
<p>I completely agree. Up until this thread, I was not aware of an
option to silence a buildbot on the main builder. In fact, it
looks like that mechanism <a moz-do-not-send="true"
href="https://github.com/llvm/llvm-zorg/commit/3c5b8f5bbc37076036997b3dd8b0137252bcb826">only
exists as of the 8th of this month</a>. Now that we have it, we
should definitely use it in favor of disabling a bot entirely. <br>
</p>
<p>This needs integrated into the docs. I'll take that action
item. <br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
<div class="WordSection1">
<p class="MsoNormal"><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">So then, what is the flow when a bot starts
having issues? I would propose that it be something like this:<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<ol style="margin-top:0in" type="1" start="1">
<li class="MsoListParagraph"
style="margin-left:0in;mso-list:l4 level1 lfo3">Code owners
have to address issues in X amount of time.<o:p></o:p></li>
<li class="MsoListParagraph"
style="margin-left:0in;mso-list:l4 level1 lfo3">If the code
owners has failed to address the situation, it falls to the
buildbot owners. Perhaps at the beginning or in the middle
of this period, the bot owners get an email that says: “Hey,
so and so, we’re close to tagging the bot “silent”, can you
have a look?”<o:p></o:p></li>
<li class="MsoListParagraph"
style="margin-left:0in;mso-list:l4 level1 lfo3">If both the
code owners and the buildbot owners have failed to address
the situation, the bot gets tagged “silent”. The buildbot
owner gets notified that this happened and the notification
spells out how much longer they have before the bot gets
turned off.<o:p></o:p></li>
<li class="MsoListParagraph"
style="margin-left:0in;mso-list:l4 level1 lfo3">If both the
code owners and the buildbot owners have failed to address
the situation for some time longer, the bot gets turned off.<o:p></o:p></li>
</ol>
<p class="MsoListParagraph"><o:p> </o:p></p>
<p class="MsoNormal">Each of this steps should be allowed a
pre-determined amount of time. A few hours? A few days?
Ideally, each of the transitions (but definitely
2->3->4) come with notifications. If it was possible for
a bot to be moved to staging automatically, we could even have
an extra step where it gets moved to staging before it gets
turned off. I don’t think that’s currently possible though.</p>
</div>
</blockquote>
<p>Now that we have a silence mechanism, I think we can split our
policy into two pieces.</p>
<p>Part 1 - When do we silence a bot</p>
<p>Part 2 - When do we disable a bot</p>
<p>I think we can afford to have a long and involved process for
part 2. Once a bot is silence, it doesn't have much cost to keep
around, and we basically only need to handle the abandoned bot
problem.</p>
<p>The majority of our focus can be on when we silence a bot. Here
I would argue pretty strongly for a different default: we should
silence and un-silence bots cheaply.</p>
<p>Here's some suggested wording:</p>
<p>If you believe a bot to be unstable, please file a github issue
describing the situation. Please either add the bot owner as he
assignee or email the bot owner directly. If the instability is
frequent - say more than 1 build in 10 - please send a change for
review which silences the builder.</p>
<p>As a bot owner, you are expected to address reported
instability. If you can't do so promptly, please silence the
bot. Once you're ready to unsilence the bot, post a change for
review which does so and describes the action taken to stabilize
the bot. <br>
</p>
<p>(Obviously, this needs expanded a bit.)<br>
</p>
<blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
<div class="WordSection1">
<p class="MsoNormal"><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">> The main problem with flaky tests is
random false blames. People get annoyed and stop paying
attention to failures on a particular builder, and other
builders as well, arguing that build bot in general is not
reliable.<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Galina made a good point to me that people
get annoyed by failures and stop paying attention to all
buildbots. </p>
</div>
</blockquote>
More immediately, people set up mail rules to ignore bots. I know
of multiple people who have these, and was on the edge of doing so
myself. This means that a bot which is spammy effectively only
harasses new contributors which is, ah, less than ideal. <br>
<blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
<div class="WordSection1">
<p class="MsoNormal">I can see how flaky tests/bots contribute
to the general ignoring of the buildbots, but I would argue
that the root cause is the sheer volume of build breaks that
are not the fault of a committer. The few times I’ve made
commits to llvm, for example, I’ve
<b>always</b> gotten at least one email about a break that was
unrelated to my change (because my changes are perfect, thank
you very much). This larger problem of build breaks is much
harder to address than flaky bots or tests, but I think would
improve the health of llvm & friends significantly more
(and in the meantime, we could tolerate some “flakiness”).</p>
</div>
</blockquote>
<p>I will note that I and Galina have been actively working on
attempts to stabilize our existing infrastructure. There's active
work on trying to add mechanisms (e.g. silencing, staged builders,
and maximum batch sizes) to cut down on the problem. Please don't
let "it's hard" become an argument that we should ignore the
problem.</p>
<p>Also, while yes many of our failures are bad changes, I think
this makes up a minority of all failure notices. I have checked
anything other than my own trash folder, but that's certainly what
I see. The biggest contributors are blatantly unstable bots and
unreasonably slow batched builders. <br>
</p>
<p><br>
</p>
<blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
<div class="WordSection1">
<p class="MsoNormal"><o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
<p class="MsoNormal">-Stella<o:p></o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<p class="MsoNormal"><o:p> </o:p></p>
<div style="border:none;border-top:solid #E1E1E1
1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b>From:</b> llvm-dev
<a class="moz-txt-link-rfc2396E" href="mailto:llvm-dev-bounces@lists.llvm.org"><llvm-dev-bounces@lists.llvm.org></a> <b>On Behalf Of
</b>Galina Kistanova via llvm-dev<br>
<b>Sent:</b> Wednesday, January 12, 2022 11:24 PM<br>
<b>To:</b> Mehdi AMINI <a class="moz-txt-link-rfc2396E" href="mailto:joker.eph@gmail.com"><joker.eph@gmail.com></a><br>
<b>Cc:</b> llvm-dev <a class="moz-txt-link-rfc2396E" href="mailto:llvm-dev@lists.llvm.org"><llvm-dev@lists.llvm.org></a><br>
<b>Subject:</b> [EXTERNAL] Re: [llvm-dev] Responsibilities
of a buildbot owner<o:p></o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">> We may also use this on flaky bots
in the future?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Yes, we may.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Or we may try to do our best to fix
them. :)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Moving workers to the staging
temporarily to investigate and address an issue is fine.
Gives a bit more elbow room for experimenting, as we can
apply experimental patches there, restart the staging as
needed and often, and so on. Which is not the case with
the production. It does not take much effort to move a
worker between the staging and the production areas - a
simple edit of the buildbot.tac file and a worker restart.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Tagging a builder "silent" means there
is a designated person or a team who is actively fixing
the detected issues or acting as a proxy to handle the
blame list. This could be a way to dial with flaky bots,
indeed, assuming there is somebody taking care of those
builders, not just a way to skip the annoyance and keep
the status quo.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">By the way, thanks everyone for the
constructive and polite discussion! It seems we are going
to have a more stable and informative Windows LLDB
builder.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Galina<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal">On Wed, Jan 12, 2022 at 9:19 PM Mehdi
AMINI <<a href="mailto:joker.eph@gmail.com"
target="_blank" moz-do-not-send="true">joker.eph@gmail.com</a>>
wrote:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal">On Wed, Jan 12, 2022 at 7:33 PM Galina
Kistanova <<a href="mailto:gkistanova@gmail.com"
target="_blank" moz-do-not-send="true">gkistanova@gmail.com</a>>
wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal">Hello everyone,<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">In continuation of the
Responsibilities of a buildbot owner thread.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">First of all, thank you very much for
being buildbot owners! This is much appreciated.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Thank you for bringing good points to
the discussion.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">It is expected that buildbot owners
own bots which are reliable, informative and helpful to
the community.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Effectively that means if a problem
is detected by a builder and it is hard to pinpoint the
reason of the issue and a commit to blame, a buildbot
owner is natively on the escalation path. Someone has to
get to the root of the problem and fix it one way or
another (by reverting the commit, or by proposing a
patch, or by working with the author of the commit which
introduced the issue). In the majority of the cases
someone takes care of an issue. But sometimes it takes a
buildbot owner to push. Every buildbot owner does this
from time to time.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Hi Mehdi,<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">> Something quite annoying with
staging is that it does not have (as far as I know) a
way<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">> to continue to notify the
buildbot owner.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">You mentioned this recently in one of
the reviews. With <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fllvm%2Fllvm-zorg%2Fcommit%2F3c5b8f5bbc37076036997b3dd8b0137252bcb826&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=L2N0O2%2FoSTSXv8wPViTIQPuGZGJqQ08D28mgubIhVLE%3D&reserved=0"
target="_blank" moz-do-not-send="true">
https://github.com/llvm/llvm-zorg/commit/3c5b8f5bbc37076036997b3dd8b0137252bcb826</a>
in place, you can add the tag "silent" to your
production builder, and it will not send notifications
to the blame list. You can set the exact notifications
you want in the master/config/status.py for that
builder. Hope this helps you.<o:p></o:p></p>
</div>
</div>
</blockquote>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Fantastic! I'll use this for the next
steps for my bots (when I get back to it, I slacked on this
recently...) :)<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">We may also use this on flaky bots in the
future?<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Thanks,<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">-- <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Mehdi <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<div>
<p class="MsoNormal">I do not want to have the staging even
able to send emails. We debug and test many things there,
including notifications, and there is always a risk of
spam.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Thanks<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Galina<o:p></o:p></p>
</div>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<p class="MsoNormal">On Sun, Jan 9, 2022 at 6:07 PM David
Blaikie via llvm-dev <<a
href="mailto:llvm-dev@lists.llvm.org" target="_blank"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>
wrote:<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">+1 to most of what Mehdi's said here -
I'd love to see improvements in stability, though probably
having some rigid delegation of responsibility (rather than
relying on developers to judge whether it's a flaky test or
flaky bot - that isn't always obvious, maybe it's only flaky
on a particular configuration that that buildbot happens to
test and the developer doesn't have access to - then which
is it?) might help (eg: if it's at all unclear, then the
assumption is that it's always the test or always the
buildbot owner - and an expectation that the author or owner
then takes responsibility for working with the other party
to address the issue, etc).<br>
<br>
That all said, disabling individual tests may risk no one
caring enough to re-enable them, especially when the
flakiness is found long after the change is made that
introduced the test or flakiness (usually the case with
flakiness - it takes a while to become apparent) - I don't
really know how to address that issue. The "convenience"
with disabling a buildbot is that there's other value to the
buildbot (other than the flaky test that was providing
negative value), so buildbot owners have more motivation to
get the bot back online - though I don't want to burden
buildbot owners unduly either (because they'd eventually
give up on them) :/ <br>
<br>
- Dave<o:p></o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Sat, Jan 8, 2022 at 5:15 PM Mehdi
AMINI via llvm-dev <<a
href="mailto:llvm-dev@lists.llvm.org" target="_blank"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>
wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<div>
<p class="MsoNormal">Hi,<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">First: thanks a lot Stella for
being a bot owner and providing valuable resources to
the community. The sequence of even is really
unfortunate here, and thank you for bringing it up to
everyone's attention, let's try to improve our
processes.<o:p></o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<div>
<div>
<p class="MsoNormal">On Sat, Jan 8, 2022 at 1:01 PM
Philip Reames via llvm-dev <<a
href="mailto:llvm-dev@lists.llvm.org"
target="_blank" moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>
wrote:<o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<p>Stella,<o:p></o:p></p>
<p>Thank you for raising the question. This is a
great discussion for us to have publicly.<o:p></o:p></p>
<p>So folks know, I am the individual Stella
mentioned below. I'll start with a bit of history
so that everyone's on the same page, then dive
into the policy question.<o:p></o:p></p>
<p>My general take is that buildbots are only useful
if failure notifications are generally
actionable. A couple months back, I was on the
edge of setting up mail filter rules to
auto-delete a bunch of bots because they were
regularly broken, and decided I should try to be
constructive first. In the first wave of that, I
emailed a couple of bot owners about things which
seemed like false positives.
<o:p></o:p></p>
<p>At the time, I thought it was the bot owners
responsibility to not be testing a flaky
configuration. I got a bit of push back on that
from a couple sources - Stella was one - and put
that question on hold. This thread is a great
opportunity to decide what our policy actually is,
and document it. <o:p></o:p></p>
<p>In the meantime, I've been working with Galina to
document existing practice where we could, and to
try to identify best practices on setting up
bots. These changes have been posted publicly,
and reviewed through the normal process. We've
been deliberately trying to stick to
non-controversial stuff as we got the docs
improved. I've been actively reaching out to bot
owners to gather feedback in this process, but
Stella had not, yet, been one.
<o:p></o:p></p>
<p>Separately, this week I noticed a bot which was
repeatedly toggling between red and green. I
forget the exact ratio, but in the recent build
history, there were multiple transitions,
seemingly unrelated to the changes being
committed. I emailed Galina asking her to
address, and she removed the buildbot until it
could be moved to the staging buildmaster,
addressed, and then restored. I left Stella off
the initial email. Sorry about that, no ill
intent, just written in a hurry.
<o:p></o:p></p>
<p>Now, transitioning into a bit of policy
discussion...<o:p></o:p></p>
<p>From my conversations with existing bot owners,
there is a general agreement that bots should only
be notifying the community if they are stable
enough. There's honest disagreement on what the
bar for stable enough is, and disagreement about
exactly whose responsibility addressing new
instability is. (To be clear, I'd separate
instability from a clear deterministic breakage
caused by a commit - we have a lot more agreement
on that.)<o:p></o:p></p>
<p>My personal take is that for a bot to be publicly
notifying, "someone" needs to take the
responsibility to backstop the normal revert to
green process. This "someone" can be developers
who work in a particular area, the bot owner, or
some combination thereof. I view the
responsibility of the bot config owner as being
the person responsible for making sure that
backstopping is happening. Not necessarily by
doing it themselves, but by having the contacts
with developers who can, and following up when the
normal flow is not working.<o:p></o:p></p>
<p>In this particular example, we appear to have a
bunch of flaky lldb tests. I personally know
absolutely nothing about lldb. I have no idea
whether the tests are badly designed, the system
they're being run on isn't yet supported by lldb,
or if there's some recent code bug introduced
which causes the failure. "Someone" needs to take
the responsibility of figuring that out, and in
the meantime spaming developers with inactionable
failure notices seems undesirable.
<o:p></o:p></p>
</div>
</blockquote>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">I generally agree with the
overall sentiment. I would add that something worse
differentiating is that the source of flakiness can
be coming from the bot itself (flaky hardware /
fragile setup), or from the test/codebase itself (a
flaky bot may just be a deterministic ASAN failure).<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Of course from Philip's point of
view it does not matter: the effect on the developer
is similar, we get undesirable and unactionable
notifications. From the maintenance flow however, it
matters in that the "someone" who has to take
responsibility is often not the same group of folks.<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Also when encountering flaky
tests, the best action may not be to disable the bot
itself but instead to disable the test itself! (and
file a bug against the test owner...).<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">One more dimension that seems to
surface here may be different practices or
expectations across subprojects, for example here
the LLDB folks may be used to having some flaky
tests, but they trigger on changes to LLVM itself,
where we may not expect any flakiness (or so).<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<p>For context, the bot was disabled until it could
be moved to the staging buildmaster. Moving to
staging is required (currently) to disable
developer notification. In the email from Galina,
it seems clear that the bot would be fine to move
back to production once the issue was triaged.
This seems entirely reasonable to me. <o:p></o:p></p>
</div>
</blockquote>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">Something quite annoying with
staging is that it does not have (as far as I know)
a way to continue to notify the buildbot owner. I
don't really care about staging vs prod as much as
having a mode to just "not notify the blame list" /
"only notify the owner".<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal">-- <o:p></o:p></p>
</div>
<div>
<p class="MsoNormal">Mehdi<o:p></o:p></p>
</div>
<div>
<p class="MsoNormal"><o:p> </o:p></p>
</div>
<div>
<p class="MsoNormal"> <o:p></o:p></p>
</div>
<blockquote style="border:none;border-left:solid #CCCCCC
1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
<div>
<p>Philip<o:p></o:p></p>
<p>p.s. One thing I'll note as a definite problem
with the current system is that a lot of this
happens in private email, and it's hard to share
so that everyone has a good picture of what's
going on. It makes miscommunications all too
easy. Last time I spoke with Galina, we were
tentative planning to start using github issues
for bot operation matters to address that, but as
that was in the middle of the transition from
bugzilla, we deferred and haven't gotten back to
that yet.<o:p></o:p></p>
<p>p.p.s. The bot in question is <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=TR4I18%2FuGHgNwK0PZprdHwg9gVikDWaUWEIXqDU5EQo%3D&reserved=0"
target="_blank" moz-do-not-send="true">
https://lab.llvm.org/buildbot/#/builders/83</a>
if folks want to examine the history themselves.
<o:p></o:p></p>
<div>
<p class="MsoNormal">On 1/8/22 12:06 PM, Stella
Stamenova via llvm-dev wrote:<o:p></o:p></p>
</div>
<blockquote
style="margin-top:5.0pt;margin-bottom:5.0pt">
<div>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Hey
all,<o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I
have a couple of questions about what the
responsibilities of a buildbot owner are. I’ve
been maintaining a couple of buildbots for
lldb and mlir for some time now and I thought
I had a pretty good idea of what is required
based on the documentation here: <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.llvm.org%2Fdocs%2FHowToAddABuilder.html&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vmuwSe4aJvyZaseAsRONqnwQT5AE2j8Fsey6n2X8aow%3D&reserved=0"
target="_blank" moz-do-not-send="true">
How To Add Your Build Configuration To LLVM
Buildbot Infrastructure — LLVM 13
documentation</a><o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">My
understanding was that there are some things
that are *<b>expected</b>* of the owner.
Namely:<o:p></o:p></p>
<ol type="1" start="1">
<li class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l1
level1 lfo6">
Make sure that the buildbot is connected and
has the right infrastructure (e.g. the right
version of Python, or tools, etc.). Update
as needed.<o:p></o:p></li>
<li class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l1
level1 lfo6">
Make sure that the build configuration is
one that is supported (e.g. supported flavor
or cmake variables). Update as needed.<o:p></o:p></li>
</ol>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">There
are also a couple of things that are *<b>optional</b>*,
but nice to have:<o:p></o:p></p>
<ol type="1" start="3">
<li class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l5
level1 lfo9">
If the buildbot stays red for a while (where
“a while” is completely subjective), figure
out the patch or patches that are causing an
issue and either revert them or notify the
authors, so they can take action.<o:p></o:p></li>
<li class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l5
level1 lfo9">
If someone is having trouble investigating a
failure that only happens on the buildbot
(or the buildbot is a rare configuration),
help them out (e.g. collect logs if
possible).<o:p></o:p></li>
</ol>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Up
to now, I’ve not had any issues with this and
the community has been very good at fixing
issues with builds and tests when I point them
out, or more often than not, without me having
to do anything but the occasional test re-run
and software update (like this one, for
example,
<a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Freviews.llvm.org%2FD114639&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ppf4tXWRAK7cf68FMTvaZqIQhkCelgDJKOrkbrhUST4%3D&reserved=0"
target="_blank" moz-do-not-send="true">
<span style="font-family:"Segoe UI
Emoji",sans-serif">⚙</span> D114639
Raise the minimum Visual Studio version to
VS2019 (llvm.org)</a>). lldb has some tests
that are flaky because of the nature of the
product, so there is some noise, but mostly
things work well and everyone seems happy.<o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I’ve
recently run into a situation that makes me
wonder whether there are other expectations of
a buildbot owner that are not explicitly
listed in the llvm documentation. Someone
reached out to me some time ago to let me know
their unhappiness at the flakiness of some of
the lldb tests and demanded that I either fix
them or disable them. I let them know that
there are some tests that are known to be
flaky, that my expectation is that it is not
my responsibility to fix all such issues and
that the community would be very happy to have
their contribution in the form of a fix or a
change to disable the tests. I didn’t get a
response from this person, but I did disable a
couple of particularly flaky tests since it
seemed like the nice thing to do.<o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">The
real excitement happened yesterday when I
received an email that *<b>the build bot had
been turned off</b>*. This same person
reached out to the powers that be (without
letting me know) and asked them explicitly to
silence it *<b>without my active involvement</b>*
because of the flakiness.<o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I
have a couple of issues with this approach but
perhaps I’ve misunderstood what my
responsibilities are as the buildbot owner. I
know it is frustrating to see a bot fail
because of flaky tests and it is nice to have
someone to ask to resolve them all – is that
really the expectation of a buildbot owner?
Where is the line between maintenance of the
bot and fixing build and test issues for the
community?<o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I’d
like to understand what the general
expectations are and if there are things
missing from the documentation, I propose that
we add them, so that it is clear for everyone
what is required.<o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Thanks,<o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">-Stella<o:p></o:p></p>
<p class="MsoNormal"
style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
</div>
<p class="MsoNormal"><o:p> </o:p></p>
<pre>_______________________________________________<o:p></o:p></pre>
<pre>LLVM Developers mailing list<o:p></o:p></pre>
<pre><a href="mailto:llvm-dev@lists.llvm.org" target="_blank" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>
<pre><a href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0" target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>
</blockquote>
</div>
<p class="MsoNormal">_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org"
target="_blank" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br>
<a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0"
target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>
</blockquote>
</div>
</div>
<p class="MsoNormal">_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br>
<a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0"
target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>
</blockquote>
</div>
<p class="MsoNormal">_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank"
moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br>
<a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0"
target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>
</div>
</blockquote>
</body>
</html>