<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 1/13/22 1:41 PM, Stella Stamenova

      wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <meta name="Generator" content="Microsoft Word 15 (filtered

        medium)">

      <style>@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face

        {font-family:"Segoe UI Emoji";

        panose-1:2 11 5 2 4 2 4 2 2 3;}@font-face

        {font-family:Consolas;

        panose-1:2 11 6 9 2 2 4 3 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}pre

        {mso-style-priority:99;

        mso-style-link:"HTML Preformatted Char";

        margin:0in;

        font-size:10.0pt;

        font-family:"Courier New";}p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph

        {mso-style-priority:34;

        margin-top:0in;

        margin-right:0in;

        margin-bottom:0in;

        margin-left:.5in;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}span.HTMLPreformattedChar

        {mso-style-name:"HTML Preformatted Char";

        mso-style-priority:99;

        mso-style-link:"HTML Preformatted";

        font-family:Consolas;}span.EmailStyle21

        {mso-style-type:personal-reply;

        font-family:"Calibri",sans-serif;

        color:windowtext;}.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;

        font-family:"Calibri",sans-serif;}div.WordSection1

        {page:WordSection1;}ol

        {margin-bottom:0in;}ul

        {margin-bottom:0in;}</style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

      <div class="WordSection1">

        <p class="MsoNormal">There are a couple of things on this thread

          that sound nice in general, but have not been clarified either

          in the discussion or in the documentation. Since the devil is

          in the details, I’d like to see us agree on the details and

          then have them added to the documentation.<o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal"><b><i>At the end of the day, there should

              be no surprises in the process and everything that can be

              should be quantified.<o:p></o:p></i></b></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">We want to encourage people to be

          responsible code and buildbot owners, not discourage them from

          contributing at all.<o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">> It is expected that buildbot owners

          own bots which are reliable, informative and helpful to the

          community.<o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">In my experience, every buildbot has

          occasional “flakiness” – be it because of code failures that

          don’t happen every time or because of connectivity issues,

          etc. Some bots are also often broken not because of any

          flakiness, but because with the large number of commits, there

          are bound to be failures.<o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">So what makes a bot not reliable enough?

          Some percentage of builds failing? Some percentage of false

          positives? Does it vary per project or is there a single

          expectation for all of llvm?<o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">I think it makes sense to say that false

          positives above a certain threshold make a buildbot not

          reliable enough and the threshold should be documented. It

          also makes sense to say that failures above a certain

          threshold make a bot not reliable enough – if the codebase is

          fragile enough that most commits cause breaks, it is possible

          that a reliable buildbot for it cannot exist.</p>

      </div>

    </blockquote>

    <p>This is a hard thing to specify, but I'm going to take a shot at

      some draft wording.</p>

    <p>We generally expect that publicly notifying builders are stable -

      meaning they do not report failures unless those failures are

      related to the commit being built.  Note that our requirement here

      is specific to notification, not the existence of the builder on

      the waterfall.  <br>

    </p>

    <p>In general, we expect a buildbot to be able to report an average

      of no more than one false positive failure per day.   We will

      sometimes allow bots with higher failure rates due to special

      circumstances - e.g. unstable hardware combined with limited

      hardware availability for a platform - but these exceptions are

      just that: exceptions.  They need to be widely discussed before

      such a bot is allowed to notify, and the build config must make it

      apparent to casual users that the bot may be unstable.  <br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite"

cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">

      <div class="WordSection1">

        <p class="MsoNormal"><o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">>  "someone" needs to take the

          responsibility to backstop the normal revert to green process.<o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">As Mehdi pointed out earlier, the root

          cause of the failure might mean that the buildbot owner or

          that a code owner is better suited to addressing it. Philip’s

          argument is that at the end of the day, it is always the

          buildbot owner if a code owner hasn’t come forward. It makes

          sense to have someone who is ultimately responsible and it

          also makes sense that everyone needs to be given time and

          notice to act on the failures.

          <o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">There has also been some mention of

          different ways to “silence” a buildbot – either by turning it

          off entirely and waiting for a bot owner to reconnect it to

          staging or production, or by tagging it as “silent”. In my

          experience, there’s a huge difference between using the

          “silent” tag and turning a bot off. In the first case, the bot

          owners will continue to receive notifications and the builds

          will continue to run. Even if the bot is red already, there’s

          some chance that new commits that add breaks will be possible

          to figure out by looking at the logs either by other

          interested parties, or by the bot owners themselves. When a

          bot is turned off for any period of time, there’s nothing that

          can be used to determine when new failures were checked in

          (aside from local builds, so many local builds) and it can be

          incredibly painful to track down<b><i>. I think bots should

              only be forcefully turned off very rarely and when nothing

              else can be done and with plenty of notice.</i></b></p>

      </div>

    </blockquote>

    <p>I completely agree.  Up until this thread, I was not aware of an

      option to silence a buildbot on the main builder.  In fact, it

      looks like that mechanism <a moz-do-not-send="true"

href="https://github.com/llvm/llvm-zorg/commit/3c5b8f5bbc37076036997b3dd8b0137252bcb826">only

        exists as of the 8th of this month</a>.  Now that we have it, we

      should definitely use it in favor of disabling a bot entirely.  <br>

    </p>

    <p>This needs integrated into the docs.  I'll take that action

      item.  <br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite"

cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">

      <div class="WordSection1">

        <p class="MsoNormal"><o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">So then, what is the flow when a bot starts

          having issues? I would propose that it be something like this:<o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <ol style="margin-top:0in" type="1" start="1">

          <li class="MsoListParagraph"

            style="margin-left:0in;mso-list:l4 level1 lfo3">Code owners

            have to address issues in X amount of time.<o:p></o:p></li>

          <li class="MsoListParagraph"

            style="margin-left:0in;mso-list:l4 level1 lfo3">If the code

            owners has failed to address the situation, it falls to the

            buildbot owners. Perhaps at the beginning or in the middle

            of this period, the bot owners get an email that says: “Hey,

            so and so, we’re close to tagging the bot “silent”, can you

            have a look?”<o:p></o:p></li>

          <li class="MsoListParagraph"

            style="margin-left:0in;mso-list:l4 level1 lfo3">If both the

            code owners and the buildbot owners have failed to address

            the situation, the bot gets tagged “silent”. The buildbot

            owner gets notified that this happened and the notification

            spells out how much longer they have before the bot gets

            turned off.<o:p></o:p></li>

          <li class="MsoListParagraph"

            style="margin-left:0in;mso-list:l4 level1 lfo3">If both the

            code owners and the buildbot owners have failed to address

            the situation for some time longer, the bot gets turned off.<o:p></o:p></li>

        </ol>

        <p class="MsoListParagraph"><o:p> </o:p></p>

        <p class="MsoNormal">Each of this steps should be allowed a

          pre-determined amount of time. A few hours? A few days?

          Ideally, each of the transitions (but definitely

          2->3->4) come with notifications. If it was possible for

          a bot to be moved to staging automatically, we could even have

          an extra step where it gets moved to staging before it gets

          turned off. I don’t think that’s currently possible though.</p>

      </div>

    </blockquote>

    <p>Now that we have a silence mechanism, I think we can split our

      policy into two pieces.</p>

    <p>Part 1 - When do we silence a bot</p>

    <p>Part 2 - When do we disable a bot</p>

    <p>I think we can afford to have a long and involved process for

      part 2.  Once a bot is silence, it doesn't have much cost to keep

      around, and we basically only need to handle the abandoned bot

      problem.</p>

    <p>The majority of our focus can be on when we silence a bot.  Here

      I would argue pretty strongly for a different default: we should

      silence and un-silence bots cheaply.</p>

    <p>Here's some suggested wording:</p>

    <p>If you believe a bot to be unstable, please file a github issue

      describing the situation.  Please either add the bot owner as he

      assignee or email the bot owner directly.  If the instability is

      frequent - say more than 1 build in 10 - please send a change for

      review which silences the builder.</p>

    <p>As a bot owner, you are expected to address reported

      instability.  If you can't do so promptly, please silence the

      bot.  Once you're ready to unsilence the bot, post a change for

      review which does so and describes the action taken to stabilize

      the bot.  <br>

    </p>

    <p>(Obviously, this needs expanded a bit.)<br>

    </p>

    <blockquote type="cite"

cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">

      <div class="WordSection1">

        <p class="MsoNormal"><o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">> The main problem with flaky tests is

          random false blames. People get annoyed and stop paying

          attention to failures on a particular builder, and other

          builders as well, arguing that build bot in general is not

          reliable.<o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">Galina made a good point to me that people

          get annoyed by failures and stop paying attention to all

          buildbots. </p>

      </div>

    </blockquote>

    More immediately, people set up mail rules to ignore bots.  I know

    of multiple people who have these, and was on the edge of doing so

    myself.  This means that a bot which is spammy effectively only

    harasses new contributors which is, ah, less than ideal.  <br>

    <blockquote type="cite"

cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">

      <div class="WordSection1">

        <p class="MsoNormal">I can see how flaky tests/bots contribute

          to the general ignoring of the buildbots, but I would argue

          that the root cause is the sheer volume of build breaks that

          are not the fault of a committer. The few times I’ve made

          commits to llvm, for example, I’ve

          <b>always</b> gotten at least one email about a break that was

          unrelated to my change (because my changes are perfect, thank

          you very much). This larger problem of build breaks is much

          harder to address than flaky bots or tests, but I think would

          improve the health of llvm & friends significantly more

          (and in the meantime, we could tolerate some “flakiness”).</p>

      </div>

    </blockquote>

    <p>I will note that I and Galina have been actively working on

      attempts to stabilize our existing infrastructure.  There's active

      work on trying to add mechanisms (e.g. silencing, staged builders,

      and maximum batch sizes) to cut down on the problem.  Please don't

      let "it's hard" become an argument that we should ignore the

      problem.</p>

    <p>Also, while yes many of our failures are bad changes, I think

      this makes up a minority of all failure notices.  I have checked

      anything other than my own trash folder, but that's certainly what

      I see.  The biggest contributors are blatantly unstable bots and

      unreasonably slow batched builders.  <br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite"

cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">

      <div class="WordSection1">

        <p class="MsoNormal"><o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal">Thanks,<o:p></o:p></p>

        <p class="MsoNormal">-Stella<o:p></o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <p class="MsoNormal"><o:p> </o:p></p>

        <div style="border:none;border-top:solid #E1E1E1

          1.0pt;padding:3.0pt 0in 0in 0in">

          <p class="MsoNormal"><b>From:</b> llvm-dev

            <a class="moz-txt-link-rfc2396E" href="mailto:llvm-dev-bounces@lists.llvm.org"><llvm-dev-bounces@lists.llvm.org></a> <b>On Behalf Of

            </b>Galina Kistanova via llvm-dev<br>

            <b>Sent:</b> Wednesday, January 12, 2022 11:24 PM<br>

            <b>To:</b> Mehdi AMINI <a class="moz-txt-link-rfc2396E" href="mailto:joker.eph@gmail.com"><joker.eph@gmail.com></a><br>

            <b>Cc:</b> llvm-dev <a class="moz-txt-link-rfc2396E" href="mailto:llvm-dev@lists.llvm.org"><llvm-dev@lists.llvm.org></a><br>

            <b>Subject:</b> [EXTERNAL] Re: [llvm-dev] Responsibilities

            of a buildbot owner<o:p></o:p></p>

        </div>

        <p class="MsoNormal"><o:p> </o:p></p>

        <div>

          <div>

            <p class="MsoNormal">> We may also use this on flaky bots

              in the future?<o:p></o:p></p>

          </div>

          <div>

            <p class="MsoNormal"><o:p> </o:p></p>

          </div>

          <div>

            <p class="MsoNormal">Yes, we may.<o:p></o:p></p>

          </div>

          <div>

            <p class="MsoNormal">Or we may try to do our best to fix

              them. :)<o:p></o:p></p>

          </div>

          <div>

            <p class="MsoNormal"><o:p> </o:p></p>

          </div>

          <div>

            <p class="MsoNormal">Moving workers to the staging

              temporarily to investigate and address an issue is fine.

              Gives a bit more elbow room for experimenting, as we can

              apply experimental patches there, restart the staging as

              needed and often, and so on. Which is not the case with

              the production. It does not take much effort to move a

              worker between the staging and the production areas - a

              simple edit of the buildbot.tac file and a worker restart.<o:p></o:p></p>

          </div>

          <div>

            <p class="MsoNormal"><o:p> </o:p></p>

          </div>

          <div>

            <p class="MsoNormal">Tagging a builder "silent" means there

              is a designated person or a team who is actively fixing

              the detected issues or acting as a proxy to handle the

              blame list. This could be a way to dial with flaky bots,

              indeed, assuming there is somebody taking care of those

              builders, not just a way to skip the annoyance and keep

              the status quo.<o:p></o:p></p>

          </div>

          <div>

            <p class="MsoNormal"><o:p> </o:p></p>

          </div>

          <div>

            <p class="MsoNormal">By the way, thanks everyone for the

              constructive and polite discussion! It seems we are going

              to have a more stable and informative Windows LLDB

              builder.<o:p></o:p></p>

          </div>

          <div>

            <p class="MsoNormal"><o:p> </o:p></p>

          </div>

          <div>

            <p class="MsoNormal">Galina<o:p></o:p></p>

          </div>

          <div>

            <p class="MsoNormal"><o:p> </o:p></p>

          </div>

        </div>

        <p class="MsoNormal"><o:p> </o:p></p>

        <div>

          <p class="MsoNormal">On Wed, Jan 12, 2022 at 9:19 PM Mehdi

            AMINI <<a href="mailto:joker.eph@gmail.com"

              target="_blank" moz-do-not-send="true">joker.eph@gmail.com</a>>

            wrote:<o:p></o:p></p>

        </div>

        <div>

          <p class="MsoNormal"><o:p> </o:p></p>

        </div>

        <p class="MsoNormal"><o:p> </o:p></p>

        <div>

          <p class="MsoNormal">On Wed, Jan 12, 2022 at 7:33 PM Galina

            Kistanova <<a href="mailto:gkistanova@gmail.com"

              target="_blank" moz-do-not-send="true">gkistanova@gmail.com</a>>

            wrote:<o:p></o:p></p>

        </div>

        <blockquote style="border:none;border-left:solid #CCCCCC

          1.0pt;padding:0in 0in 0in

6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">

          <div>

            <div>

              <p class="MsoNormal">Hello everyone,<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">In continuation of the

                Responsibilities of a buildbot owner thread.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">First of all, thank you very much for

                being buildbot owners! This is much appreciated.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal">Thank you for bringing good points to

                the discussion.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">It is expected that buildbot owners

                own bots which are reliable, informative and helpful to

                the community.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">Effectively that means if a problem

                is detected by a builder and it is hard to pinpoint the

                reason of the issue and a commit to blame, a buildbot

                owner is natively on the escalation path. Someone has to

                get to the root of the problem and fix it one way or

                another (by reverting the commit, or by proposing a

                patch, or by working with the author of the commit which

                introduced the issue). In the majority of the cases

                someone takes care of an issue. But sometimes it takes a

                buildbot owner to push. Every buildbot owner does this

                from time to time.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">Hi Mehdi,<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">> Something quite annoying with

                staging is that it does not have (as far as I know) a

                way<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal">> to continue to notify the

                buildbot owner.<o:p></o:p></p>

            </div>

            <div>

              <p class="MsoNormal"><o:p> </o:p></p>

            </div>

            <div>

              <p class="MsoNormal">You mentioned this recently in one of

                the reviews. With <a

href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fllvm%2Fllvm-zorg%2Fcommit%2F3c5b8f5bbc37076036997b3dd8b0137252bcb826&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=L2N0O2%2FoSTSXv8wPViTIQPuGZGJqQ08D28mgubIhVLE%3D&reserved=0"

                  target="_blank" moz-do-not-send="true">

https://github.com/llvm/llvm-zorg/commit/3c5b8f5bbc37076036997b3dd8b0137252bcb826</a>

                in place, you can add the tag "silent" to your

                production builder, and it will not send notifications

                to the blame list. You can set the exact notifications

                you want in the master/config/status.py for that

                builder. Hope this helps you.<o:p></o:p></p>

            </div>

          </div>

        </blockquote>

        <div>

          <p class="MsoNormal"><o:p> </o:p></p>

        </div>

        <div>

          <p class="MsoNormal">Fantastic! I'll use this for the next

            steps for my bots (when I get back to it, I slacked on this

            recently...) :)<o:p></o:p></p>

        </div>

        <div>

          <p class="MsoNormal"><o:p> </o:p></p>

        </div>

        <div>

          <p class="MsoNormal">We may also use this on flaky bots in the

            future?<o:p></o:p></p>

        </div>

        <div>

          <p class="MsoNormal"><o:p> </o:p></p>

        </div>

        <div>

          <p class="MsoNormal">Thanks,<o:p></o:p></p>

        </div>

        <div>

          <p class="MsoNormal"><o:p> </o:p></p>

        </div>

        <div>

          <p class="MsoNormal">-- <o:p></o:p></p>

        </div>

        <div>

          <p class="MsoNormal">Mehdi <o:p></o:p></p>

        </div>

        <div>

          <p class="MsoNormal"><o:p> </o:p></p>

        </div>

        <div>

          <div>

            <p class="MsoNormal">I do not want to have the staging even

              able to send emails. We debug and test many things there,

              including notifications, and there is always a risk of

              spam.<o:p></o:p></p>

          </div>

          <div>

            <p class="MsoNormal"><o:p> </o:p></p>

          </div>

          <div>

            <p class="MsoNormal">Thanks<o:p></o:p></p>

          </div>

          <div>

            <p class="MsoNormal"><o:p> </o:p></p>

          </div>

          <div>

            <p class="MsoNormal">Galina<o:p></o:p></p>

          </div>

        </div>

        <p class="MsoNormal"><o:p> </o:p></p>

        <div>

          <p class="MsoNormal">On Sun, Jan 9, 2022 at 6:07 PM David

            Blaikie via llvm-dev <<a

              href="mailto:llvm-dev@lists.llvm.org" target="_blank"

              moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>

            wrote:<o:p></o:p></p>

        </div>

        <div>

          <p class="MsoNormal">+1 to most of what Mehdi's said here -

            I'd love to see improvements in stability, though probably

            having some rigid delegation of responsibility (rather than

            relying on developers to judge whether it's a flaky test or

            flaky bot - that isn't always obvious, maybe it's only flaky

            on a particular configuration that that buildbot happens to

            test and the developer doesn't have access to - then which

            is it?) might help (eg: if it's at all unclear, then the

            assumption is that it's always the test or always the

            buildbot owner - and an expectation that the author or owner

            then takes responsibility for working with the other party

            to address the issue, etc).<br>

            <br>

            That all said, disabling individual tests may risk no one

            caring enough to re-enable them, especially when the

            flakiness is found long after the change is made that

            introduced the test or flakiness (usually the case with

            flakiness - it takes a while to become apparent) - I don't

            really know how to address that issue. The "convenience"

            with disabling a buildbot is that there's other value to the

            buildbot (other than the flaky test that was providing

            negative value), so buildbot owners have more motivation to

            get the bot back online - though I don't want to burden

            buildbot owners unduly either (because they'd eventually

            give up on them) :/ <br>

            <br>

            - Dave<o:p></o:p></p>

        </div>

        <p class="MsoNormal"><o:p> </o:p></p>

        <div>

          <div>

            <p class="MsoNormal">On Sat, Jan 8, 2022 at 5:15 PM Mehdi

              AMINI via llvm-dev <<a

                href="mailto:llvm-dev@lists.llvm.org" target="_blank"

                moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>

              wrote:<o:p></o:p></p>

          </div>

          <blockquote style="border:none;border-left:solid #CCCCCC

            1.0pt;padding:0in 0in 0in

6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">

            <div>

              <div>

                <p class="MsoNormal">Hi,<o:p></o:p></p>

              </div>

              <div>

                <p class="MsoNormal"><o:p> </o:p></p>

              </div>

              <div>

                <p class="MsoNormal">First: thanks a lot Stella for

                  being a bot owner and providing valuable resources to

                  the community. The sequence of even is really

                  unfortunate here, and thank you for bringing it up to

                  everyone's attention, let's try to improve our

                  processes.<o:p></o:p></p>

              </div>

              <p class="MsoNormal"><o:p> </o:p></p>

              <div>

                <div>

                  <p class="MsoNormal">On Sat, Jan 8, 2022 at 1:01 PM

                    Philip Reames via llvm-dev <<a

                      href="mailto:llvm-dev@lists.llvm.org"

                      target="_blank" moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>

                    wrote:<o:p></o:p></p>

                </div>

                <blockquote style="border:none;border-left:solid #CCCCCC

                  1.0pt;padding:0in 0in 0in

6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">

                  <div>

                    <p>Stella,<o:p></o:p></p>

                    <p>Thank you for raising the question.  This is a

                      great discussion for us to have publicly.<o:p></o:p></p>

                    <p>So folks know, I am the individual Stella

                      mentioned below.  I'll start with a bit of history

                      so that everyone's on the same page, then dive

                      into the policy question.<o:p></o:p></p>

                    <p>My general take is that buildbots are only useful

                      if failure notifications are generally

                      actionable.  A couple months back, I was on the

                      edge of setting up mail filter rules to

                      auto-delete a bunch of bots because they were

                      regularly broken, and decided I should try to be

                      constructive first.  In the first wave of that, I

                      emailed a couple of bot owners about things which

                      seemed like false positives. 

                      <o:p></o:p></p>

                    <p>At the time, I thought it was the bot owners

                      responsibility to not be testing a flaky

                      configuration.  I got a bit of push back on that

                      from a couple sources - Stella was one - and put

                      that question on hold.  This thread is a great

                      opportunity to decide what our policy actually is,

                      and document it.  <o:p></o:p></p>

                    <p>In the meantime, I've been working with Galina to

                      document existing practice where we could, and to

                      try to identify best practices on setting up

                      bots.  These changes have been posted publicly,

                      and reviewed through the normal process.  We've

                      been deliberately trying to stick to

                      non-controversial stuff as we got the docs

                      improved.  I've been actively reaching out to bot

                      owners to gather feedback in this process, but

                      Stella had not, yet, been one.

                      <o:p></o:p></p>

                    <p>Separately, this week I noticed a bot which was

                      repeatedly toggling between red and green.  I

                      forget the exact ratio, but in the recent build

                      history, there were multiple transitions,

                      seemingly unrelated to the changes being

                      committed.  I emailed Galina asking her to

                      address, and she removed the buildbot until it

                      could be moved to the staging buildmaster,

                      addressed, and then restored.  I left Stella off

                      the initial email.  Sorry about that, no ill

                      intent, just written in a hurry. 

                      <o:p></o:p></p>

                    <p>Now, transitioning into a bit of policy

                      discussion...<o:p></o:p></p>

                    <p>From my conversations with existing bot owners,

                      there is a general agreement that bots should only

                      be notifying the community if they are stable

                      enough.  There's honest disagreement on what the

                      bar for stable enough is, and disagreement about

                      exactly whose responsibility addressing new

                      instability is.  (To be clear, I'd separate

                      instability from a clear deterministic breakage

                      caused by a commit - we have a lot more agreement

                      on that.)<o:p></o:p></p>

                    <p>My personal take is that for a bot to be publicly

                      notifying, "someone" needs to take the

                      responsibility to backstop the normal revert to

                      green process.  This "someone" can be developers

                      who work in a particular area, the bot owner, or

                      some combination thereof.  I view the

                      responsibility of the bot config owner as being

                      the person responsible for making sure that

                      backstopping is happening.  Not necessarily by

                      doing it themselves, but by having the contacts

                      with developers who can, and following up when the

                      normal flow is not working.<o:p></o:p></p>

                    <p>In this particular example, we appear to have a

                      bunch of flaky lldb tests.  I personally know

                      absolutely nothing about lldb.  I have no idea

                      whether the tests are badly designed, the system

                      they're being run on isn't yet supported by lldb,

                      or if there's some recent code bug introduced

                      which causes the failure.  "Someone" needs to take

                      the responsibility of figuring that out, and in

                      the meantime spaming developers with inactionable

                      failure notices seems undesirable. 

                      <o:p></o:p></p>

                  </div>

                </blockquote>

                <div>

                  <p class="MsoNormal"><o:p> </o:p></p>

                </div>

                <div>

                  <p class="MsoNormal">I generally agree with the

                    overall sentiment. I would add that something worse

                    differentiating is that the source of flakiness can

                    be coming from the bot itself (flaky hardware /

                    fragile setup), or from the test/codebase itself (a

                    flaky bot may just be a deterministic ASAN failure).<o:p></o:p></p>

                </div>

                <div>

                  <p class="MsoNormal">Of course from Philip's point of

                    view it does not matter: the effect on the developer

                    is similar, we get undesirable and unactionable

                    notifications. From the maintenance flow however, it

                    matters in that the "someone" who has to take

                    responsibility is often not the same group of folks.<o:p></o:p></p>

                </div>

                <div>

                  <p class="MsoNormal">Also when encountering flaky

                    tests, the best action may not be to disable the bot

                    itself but instead to disable the test itself! (and

                    file a bug against the test owner...).<o:p></o:p></p>

                </div>

                <div>

                  <p class="MsoNormal"><o:p> </o:p></p>

                </div>

                <div>

                  <p class="MsoNormal">One more dimension that seems to

                    surface here may be different practices or

                    expectations across subprojects, for example here

                    the LLDB folks may be used to having some flaky

                    tests, but they trigger on changes to LLVM itself,

                    where we may not expect any flakiness (or so).<o:p></o:p></p>

                </div>

                <div>

                  <p class="MsoNormal"> <o:p></o:p></p>

                </div>

                <blockquote style="border:none;border-left:solid #CCCCCC

                  1.0pt;padding:0in 0in 0in

6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">

                  <div>

                    <p>For context, the bot was disabled until it could

                      be moved to the staging buildmaster.  Moving to

                      staging is required (currently) to disable

                      developer notification.  In the email from Galina,

                      it seems clear that the bot would be fine to move

                      back to production once the issue was triaged. 

                      This seems entirely reasonable to me.  <o:p></o:p></p>

                  </div>

                </blockquote>

                <div>

                  <p class="MsoNormal"><o:p> </o:p></p>

                </div>

                <div>

                  <p class="MsoNormal">Something quite annoying with

                    staging is that it does not have (as far as I know)

                    a way to continue to notify the buildbot owner. I

                    don't really care about staging vs prod as much as

                    having a mode to just "not notify the blame list" /

                    "only notify the owner".<o:p></o:p></p>

                </div>

                <div>

                  <p class="MsoNormal"><o:p> </o:p></p>

                </div>

                <div>

                  <p class="MsoNormal">-- <o:p></o:p></p>

                </div>

                <div>

                  <p class="MsoNormal">Mehdi<o:p></o:p></p>

                </div>

                <div>

                  <p class="MsoNormal"><o:p> </o:p></p>

                </div>

                <div>

                  <p class="MsoNormal"> <o:p></o:p></p>

                </div>

                <blockquote style="border:none;border-left:solid #CCCCCC

                  1.0pt;padding:0in 0in 0in

6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">

                  <div>

                    <p>Philip<o:p></o:p></p>

                    <p>p.s. One thing I'll note as a definite problem

                      with the current system is that a lot of this

                      happens in private email, and it's hard to share

                      so that everyone has a good picture of what's

                      going on.  It makes miscommunications all too

                      easy.  Last time I spoke with Galina, we were

                      tentative planning to start using github issues

                      for bot operation matters to address that, but as

                      that was in the middle of the transition from

                      bugzilla, we deferred and haven't gotten back to

                      that yet.<o:p></o:p></p>

                    <p>p.p.s. The bot in question is <a

href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=TR4I18%2FuGHgNwK0PZprdHwg9gVikDWaUWEIXqDU5EQo%3D&reserved=0"

                        target="_blank" moz-do-not-send="true">

                        https://lab.llvm.org/buildbot/#/builders/83</a>

                      if folks want to examine the history themselves. 

                      <o:p></o:p></p>

                    <div>

                      <p class="MsoNormal">On 1/8/22 12:06 PM, Stella

                        Stamenova via llvm-dev wrote:<o:p></o:p></p>

                    </div>

                    <blockquote

                      style="margin-top:5.0pt;margin-bottom:5.0pt">

                      <div>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Hey

                          all,<o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I

                          have a couple of questions about what the

                          responsibilities of a buildbot owner are. I’ve

                          been maintaining a couple of buildbots for

                          lldb and mlir for some time now and I thought

                          I had a pretty good idea of what is required

                          based on the documentation here: <a

href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.llvm.org%2Fdocs%2FHowToAddABuilder.html&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vmuwSe4aJvyZaseAsRONqnwQT5AE2j8Fsey6n2X8aow%3D&reserved=0"

                            target="_blank" moz-do-not-send="true">

                            How To Add Your Build Configuration To LLVM

                            Buildbot Infrastructure — LLVM 13

                            documentation</a><o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">My

                          understanding was that there are some things

                          that are *<b>expected</b>* of the owner.

                          Namely:<o:p></o:p></p>

                        <ol type="1" start="1">

                          <li class="MsoNormal"

                            style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l1

                            level1 lfo6">

                            Make sure that the buildbot is connected and

                            has the right infrastructure (e.g. the right

                            version of Python, or tools, etc.). Update

                            as needed.<o:p></o:p></li>

                          <li class="MsoNormal"

                            style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l1

                            level1 lfo6">

                            Make sure that the build configuration is

                            one that is supported (e.g. supported flavor

                            or cmake variables). Update as needed.<o:p></o:p></li>

                        </ol>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">There

                          are also a couple of things that are *<b>optional</b>*,

                          but nice to have:<o:p></o:p></p>

                        <ol type="1" start="3">

                          <li class="MsoNormal"

                            style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l5

                            level1 lfo9">

                            If the buildbot stays red for a while (where

                            “a while” is completely subjective), figure

                            out the patch or patches that are causing an

                            issue and either revert them or notify the

                            authors, so they can take action.<o:p></o:p></li>

                          <li class="MsoNormal"

                            style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l5

                            level1 lfo9">

                            If someone is having trouble investigating a

                            failure that only happens on the buildbot

                            (or the buildbot is a rare configuration),

                            help them out (e.g. collect logs if

                            possible).<o:p></o:p></li>

                        </ol>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Up

                          to now, I’ve not had any issues with this and

                          the community has been very good at fixing

                          issues with builds and tests when I point them

                          out, or more often than not, without me having

                          to do anything but the occasional test re-run

                          and software update (like this one, for

                          example,

                          <a

href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Freviews.llvm.org%2FD114639&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ppf4tXWRAK7cf68FMTvaZqIQhkCelgDJKOrkbrhUST4%3D&reserved=0"

                            target="_blank" moz-do-not-send="true">

                            <span style="font-family:"Segoe UI

                              Emoji",sans-serif">⚙</span> D114639

                            Raise the minimum Visual Studio version to

                            VS2019 (llvm.org)</a>). lldb has some tests

                          that are flaky because of the nature of the

                          product, so there is some noise, but mostly

                          things work well and everyone seems happy.<o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I’ve

                          recently run into a situation that makes me

                          wonder whether there are other expectations of

                          a buildbot owner that are not explicitly

                          listed in the llvm documentation. Someone

                          reached out to me some time ago to let me know

                          their unhappiness at the flakiness of some of

                          the lldb tests and demanded that I either fix

                          them or disable them. I let them know that

                          there are some tests that are known to be

                          flaky, that my expectation is that it is not

                          my responsibility to fix all such issues and

                          that the community would be very happy to have

                          their contribution in the form of a fix or a

                          change to disable the tests. I didn’t get a

                          response from this person, but I did disable a

                          couple of particularly flaky tests since it

                          seemed like the nice thing to do.<o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">The

                          real excitement happened yesterday when I

                          received an email that *<b>the build bot had

                            been turned off</b>*. This same person

                          reached out to the powers that be (without

                          letting me know) and asked them explicitly to

                          silence it *<b>without my active involvement</b>*

                          because of the flakiness.<o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I

                          have a couple of issues with this approach but

                          perhaps I’ve misunderstood what my

                          responsibilities are as the buildbot owner. I

                          know it is frustrating to see a bot fail

                          because of flaky tests and it is nice to have

                          someone to ask to resolve them all – is that

                          really the expectation of a buildbot owner?

                          Where is the line between maintenance of the

                          bot and fixing build and test issues for the

                          community?<o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I’d

                          like to understand what the general

                          expectations are and if there are things

                          missing from the documentation, I propose that

                          we add them, so that it is clear for everyone

                          what is required.<o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Thanks,<o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">-Stella<o:p></o:p></p>

                        <p class="MsoNormal"

                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>

                      </div>

                      <p class="MsoNormal"><o:p> </o:p></p>

                      <pre>_______________________________________________<o:p></o:p></pre>

                      <pre>LLVM Developers mailing list<o:p></o:p></pre>

                      <pre><a href="mailto:llvm-dev@lists.llvm.org" target="_blank" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>

                      <pre><a href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0" target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>

                    </blockquote>

                  </div>

                  <p class="MsoNormal">_______________________________________________<br>

                    LLVM Developers mailing list<br>

                    <a href="mailto:llvm-dev@lists.llvm.org"

                      target="_blank" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br>

                    <a

href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0"

                      target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>

                </blockquote>

              </div>

            </div>

            <p class="MsoNormal">_______________________________________________<br>

              LLVM Developers mailing list<br>

              <a href="mailto:llvm-dev@lists.llvm.org" target="_blank"

                moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br>

              <a

href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0"

                target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>

          </blockquote>

        </div>

        <p class="MsoNormal">_______________________________________________<br>

          LLVM Developers mailing list<br>

          <a href="mailto:llvm-dev@lists.llvm.org" target="_blank"

            moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br>

          <a

href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0"

            target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>

      </div>

    </blockquote>

  </body>

</html>