<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 1/13/22 1:41 PM, Stella Stamenova
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <style>@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}@font-face
        {font-family:"Segoe UI Emoji";
        panose-1:2 11 5 2 4 2 4 2 2 3;}@font-face
        {font-family:Consolas;
        panose-1:2 11 6 9 2 2 4 3 2 4;}p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}pre
        {mso-style-priority:99;
        mso-style-link:"HTML Preformatted Char";
        margin:0in;
        font-size:10.0pt;
        font-family:"Courier New";}p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
        {mso-style-priority:34;
        margin-top:0in;
        margin-right:0in;
        margin-bottom:0in;
        margin-left:.5in;
        font-size:11.0pt;
        font-family:"Calibri",sans-serif;}span.HTMLPreformattedChar
        {mso-style-name:"HTML Preformatted Char";
        mso-style-priority:99;
        mso-style-link:"HTML Preformatted";
        font-family:Consolas;}span.EmailStyle21
        {mso-style-type:personal-reply;
        font-family:"Calibri",sans-serif;
        color:windowtext;}.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;
        font-family:"Calibri",sans-serif;}div.WordSection1
        {page:WordSection1;}ol
        {margin-bottom:0in;}ul
        {margin-bottom:0in;}</style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p class="MsoNormal">There are a couple of things on this thread
          that sound nice in general, but have not been clarified either
          in the discussion or in the documentation. Since the devil is
          in the details, I’d like to see us agree on the details and
          then have them added to the documentation.<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal"><b><i>At the end of the day, there should
              be no surprises in the process and everything that can be
              should be quantified.<o:p></o:p></i></b></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">We want to encourage people to be
          responsible code and buildbot owners, not discourage them from
          contributing at all.<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">> It is expected that buildbot owners
          own bots which are reliable, informative and helpful to the
          community.<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">In my experience, every buildbot has
          occasional “flakiness” – be it because of code failures that
          don’t happen every time or because of connectivity issues,
          etc. Some bots are also often broken not because of any
          flakiness, but because with the large number of commits, there
          are bound to be failures.<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">So what makes a bot not reliable enough?
          Some percentage of builds failing? Some percentage of false
          positives? Does it vary per project or is there a single
          expectation for all of llvm?<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">I think it makes sense to say that false
          positives above a certain threshold make a buildbot not
          reliable enough and the threshold should be documented. It
          also makes sense to say that failures above a certain
          threshold make a bot not reliable enough – if the codebase is
          fragile enough that most commits cause breaks, it is possible
          that a reliable buildbot for it cannot exist.</p>
      </div>
    </blockquote>
    <p>This is a hard thing to specify, but I'm going to take a shot at
      some draft wording.</p>
    <p>We generally expect that publicly notifying builders are stable -
      meaning they do not report failures unless those failures are
      related to the commit being built.  Note that our requirement here
      is specific to notification, not the existence of the builder on
      the waterfall.  <br>
    </p>
    <p>In general, we expect a buildbot to be able to report an average
      of no more than one false positive failure per day.   We will
      sometimes allow bots with higher failure rates due to special
      circumstances - e.g. unstable hardware combined with limited
      hardware availability for a platform - but these exceptions are
      just that: exceptions.  They need to be widely discussed before
      such a bot is allowed to notify, and the build config must make it
      apparent to casual users that the bot may be unstable.  <br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
      <div class="WordSection1">
        <p class="MsoNormal"><o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">>  "someone" needs to take the
          responsibility to backstop the normal revert to green process.<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">As Mehdi pointed out earlier, the root
          cause of the failure might mean that the buildbot owner or
          that a code owner is better suited to addressing it. Philip’s
          argument is that at the end of the day, it is always the
          buildbot owner if a code owner hasn’t come forward. It makes
          sense to have someone who is ultimately responsible and it
          also makes sense that everyone needs to be given time and
          notice to act on the failures.
          <o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">There has also been some mention of
          different ways to “silence” a buildbot – either by turning it
          off entirely and waiting for a bot owner to reconnect it to
          staging or production, or by tagging it as “silent”. In my
          experience, there’s a huge difference between using the
          “silent” tag and turning a bot off. In the first case, the bot
          owners will continue to receive notifications and the builds
          will continue to run. Even if the bot is red already, there’s
          some chance that new commits that add breaks will be possible
          to figure out by looking at the logs either by other
          interested parties, or by the bot owners themselves. When a
          bot is turned off for any period of time, there’s nothing that
          can be used to determine when new failures were checked in
          (aside from local builds, so many local builds) and it can be
          incredibly painful to track down<b><i>. I think bots should
              only be forcefully turned off very rarely and when nothing
              else can be done and with plenty of notice.</i></b></p>
      </div>
    </blockquote>
    <p>I completely agree.  Up until this thread, I was not aware of an
      option to silence a buildbot on the main builder.  In fact, it
      looks like that mechanism <a moz-do-not-send="true"
href="https://github.com/llvm/llvm-zorg/commit/3c5b8f5bbc37076036997b3dd8b0137252bcb826">only
        exists as of the 8th of this month</a>.  Now that we have it, we
      should definitely use it in favor of disabling a bot entirely.  <br>
    </p>
    <p>This needs integrated into the docs.  I'll take that action
      item.  <br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
      <div class="WordSection1">
        <p class="MsoNormal"><o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">So then, what is the flow when a bot starts
          having issues? I would propose that it be something like this:<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <ol style="margin-top:0in" type="1" start="1">
          <li class="MsoListParagraph"
            style="margin-left:0in;mso-list:l4 level1 lfo3">Code owners
            have to address issues in X amount of time.<o:p></o:p></li>
          <li class="MsoListParagraph"
            style="margin-left:0in;mso-list:l4 level1 lfo3">If the code
            owners has failed to address the situation, it falls to the
            buildbot owners. Perhaps at the beginning or in the middle
            of this period, the bot owners get an email that says: “Hey,
            so and so, we’re close to tagging the bot “silent”, can you
            have a look?”<o:p></o:p></li>
          <li class="MsoListParagraph"
            style="margin-left:0in;mso-list:l4 level1 lfo3">If both the
            code owners and the buildbot owners have failed to address
            the situation, the bot gets tagged “silent”. The buildbot
            owner gets notified that this happened and the notification
            spells out how much longer they have before the bot gets
            turned off.<o:p></o:p></li>
          <li class="MsoListParagraph"
            style="margin-left:0in;mso-list:l4 level1 lfo3">If both the
            code owners and the buildbot owners have failed to address
            the situation for some time longer, the bot gets turned off.<o:p></o:p></li>
        </ol>
        <p class="MsoListParagraph"><o:p> </o:p></p>
        <p class="MsoNormal">Each of this steps should be allowed a
          pre-determined amount of time. A few hours? A few days?
          Ideally, each of the transitions (but definitely
          2->3->4) come with notifications. If it was possible for
          a bot to be moved to staging automatically, we could even have
          an extra step where it gets moved to staging before it gets
          turned off. I don’t think that’s currently possible though.</p>
      </div>
    </blockquote>
    <p>Now that we have a silence mechanism, I think we can split our
      policy into two pieces.</p>
    <p>Part 1 - When do we silence a bot</p>
    <p>Part 2 - When do we disable a bot</p>
    <p>I think we can afford to have a long and involved process for
      part 2.  Once a bot is silence, it doesn't have much cost to keep
      around, and we basically only need to handle the abandoned bot
      problem.</p>
    <p>The majority of our focus can be on when we silence a bot.  Here
      I would argue pretty strongly for a different default: we should
      silence and un-silence bots cheaply.</p>
    <p>Here's some suggested wording:</p>
    <p>If you believe a bot to be unstable, please file a github issue
      describing the situation.  Please either add the bot owner as he
      assignee or email the bot owner directly.  If the instability is
      frequent - say more than 1 build in 10 - please send a change for
      review which silences the builder.</p>
    <p>As a bot owner, you are expected to address reported
      instability.  If you can't do so promptly, please silence the
      bot.  Once you're ready to unsilence the bot, post a change for
      review which does so and describes the action taken to stabilize
      the bot.  <br>
    </p>
    <p>(Obviously, this needs expanded a bit.)<br>
    </p>
    <blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
      <div class="WordSection1">
        <p class="MsoNormal"><o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">> The main problem with flaky tests is
          random false blames. People get annoyed and stop paying
          attention to failures on a particular builder, and other
          builders as well, arguing that build bot in general is not
          reliable.<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Galina made a good point to me that people
          get annoyed by failures and stop paying attention to all
          buildbots. </p>
      </div>
    </blockquote>
    More immediately, people set up mail rules to ignore bots.  I know
    of multiple people who have these, and was on the edge of doing so
    myself.  This means that a bot which is spammy effectively only
    harasses new contributors which is, ah, less than ideal.  <br>
    <blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
      <div class="WordSection1">
        <p class="MsoNormal">I can see how flaky tests/bots contribute
          to the general ignoring of the buildbots, but I would argue
          that the root cause is the sheer volume of build breaks that
          are not the fault of a committer. The few times I’ve made
          commits to llvm, for example, I’ve
          <b>always</b> gotten at least one email about a break that was
          unrelated to my change (because my changes are perfect, thank
          you very much). This larger problem of build breaks is much
          harder to address than flaky bots or tests, but I think would
          improve the health of llvm & friends significantly more
          (and in the meantime, we could tolerate some “flakiness”).</p>
      </div>
    </blockquote>
    <p>I will note that I and Galina have been actively working on
      attempts to stabilize our existing infrastructure.  There's active
      work on trying to add mechanisms (e.g. silencing, staged builders,
      and maximum batch sizes) to cut down on the problem.  Please don't
      let "it's hard" become an argument that we should ignore the
      problem.</p>
    <p>Also, while yes many of our failures are bad changes, I think
      this makes up a minority of all failure notices.  I have checked
      anything other than my own trash folder, but that's certainly what
      I see.  The biggest contributors are blatantly unstable bots and
      unreasonably slow batched builders.  <br>
    </p>
    <p><br>
    </p>
    <blockquote type="cite"
cite="mid:MWHPR21MB1562BB29A1A51FC91890CF2CD9539@MWHPR21MB1562.namprd21.prod.outlook.com">
      <div class="WordSection1">
        <p class="MsoNormal"><o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Thanks,<o:p></o:p></p>
        <p class="MsoNormal">-Stella<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div style="border:none;border-top:solid #E1E1E1
          1.0pt;padding:3.0pt 0in 0in 0in">
          <p class="MsoNormal"><b>From:</b> llvm-dev
            <a class="moz-txt-link-rfc2396E" href="mailto:llvm-dev-bounces@lists.llvm.org"><llvm-dev-bounces@lists.llvm.org></a> <b>On Behalf Of
            </b>Galina Kistanova via llvm-dev<br>
            <b>Sent:</b> Wednesday, January 12, 2022 11:24 PM<br>
            <b>To:</b> Mehdi AMINI <a class="moz-txt-link-rfc2396E" href="mailto:joker.eph@gmail.com"><joker.eph@gmail.com></a><br>
            <b>Cc:</b> llvm-dev <a class="moz-txt-link-rfc2396E" href="mailto:llvm-dev@lists.llvm.org"><llvm-dev@lists.llvm.org></a><br>
            <b>Subject:</b> [EXTERNAL] Re: [llvm-dev] Responsibilities
            of a buildbot owner<o:p></o:p></p>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <div>
            <p class="MsoNormal">> We may also use this on flaky bots
              in the future?<o:p></o:p></p>
          </div>
          <div>
            <p class="MsoNormal"><o:p> </o:p></p>
          </div>
          <div>
            <p class="MsoNormal">Yes, we may.<o:p></o:p></p>
          </div>
          <div>
            <p class="MsoNormal">Or we may try to do our best to fix
              them. :)<o:p></o:p></p>
          </div>
          <div>
            <p class="MsoNormal"><o:p> </o:p></p>
          </div>
          <div>
            <p class="MsoNormal">Moving workers to the staging
              temporarily to investigate and address an issue is fine.
              Gives a bit more elbow room for experimenting, as we can
              apply experimental patches there, restart the staging as
              needed and often, and so on. Which is not the case with
              the production. It does not take much effort to move a
              worker between the staging and the production areas - a
              simple edit of the buildbot.tac file and a worker restart.<o:p></o:p></p>
          </div>
          <div>
            <p class="MsoNormal"><o:p> </o:p></p>
          </div>
          <div>
            <p class="MsoNormal">Tagging a builder "silent" means there
              is a designated person or a team who is actively fixing
              the detected issues or acting as a proxy to handle the
              blame list. This could be a way to dial with flaky bots,
              indeed, assuming there is somebody taking care of those
              builders, not just a way to skip the annoyance and keep
              the status quo.<o:p></o:p></p>
          </div>
          <div>
            <p class="MsoNormal"><o:p> </o:p></p>
          </div>
          <div>
            <p class="MsoNormal">By the way, thanks everyone for the
              constructive and polite discussion! It seems we are going
              to have a more stable and informative Windows LLDB
              builder.<o:p></o:p></p>
          </div>
          <div>
            <p class="MsoNormal"><o:p> </o:p></p>
          </div>
          <div>
            <p class="MsoNormal">Galina<o:p></o:p></p>
          </div>
          <div>
            <p class="MsoNormal"><o:p> </o:p></p>
          </div>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <p class="MsoNormal">On Wed, Jan 12, 2022 at 9:19 PM Mehdi
            AMINI <<a href="mailto:joker.eph@gmail.com"
              target="_blank" moz-do-not-send="true">joker.eph@gmail.com</a>>
            wrote:<o:p></o:p></p>
        </div>
        <div>
          <p class="MsoNormal"><o:p> </o:p></p>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <p class="MsoNormal">On Wed, Jan 12, 2022 at 7:33 PM Galina
            Kistanova <<a href="mailto:gkistanova@gmail.com"
              target="_blank" moz-do-not-send="true">gkistanova@gmail.com</a>>
            wrote:<o:p></o:p></p>
        </div>
        <blockquote style="border:none;border-left:solid #CCCCCC
          1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
          <div>
            <div>
              <p class="MsoNormal">Hello everyone,<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal"><o:p> </o:p></p>
            </div>
            <div>
              <p class="MsoNormal">In continuation of the
                Responsibilities of a buildbot owner thread.<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal"><o:p> </o:p></p>
            </div>
            <div>
              <p class="MsoNormal">First of all, thank you very much for
                being buildbot owners! This is much appreciated.<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal">Thank you for bringing good points to
                the discussion.<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal"><o:p> </o:p></p>
            </div>
            <div>
              <p class="MsoNormal">It is expected that buildbot owners
                own bots which are reliable, informative and helpful to
                the community.<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal"><o:p> </o:p></p>
            </div>
            <div>
              <p class="MsoNormal">Effectively that means if a problem
                is detected by a builder and it is hard to pinpoint the
                reason of the issue and a commit to blame, a buildbot
                owner is natively on the escalation path. Someone has to
                get to the root of the problem and fix it one way or
                another (by reverting the commit, or by proposing a
                patch, or by working with the author of the commit which
                introduced the issue). In the majority of the cases
                someone takes care of an issue. But sometimes it takes a
                buildbot owner to push. Every buildbot owner does this
                from time to time.<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal"><o:p> </o:p></p>
            </div>
            <div>
              <p class="MsoNormal">Hi Mehdi,<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal"><o:p> </o:p></p>
            </div>
            <div>
              <p class="MsoNormal">> Something quite annoying with
                staging is that it does not have (as far as I know) a
                way<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal">> to continue to notify the
                buildbot owner.<o:p></o:p></p>
            </div>
            <div>
              <p class="MsoNormal"><o:p> </o:p></p>
            </div>
            <div>
              <p class="MsoNormal">You mentioned this recently in one of
                the reviews. With <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fllvm%2Fllvm-zorg%2Fcommit%2F3c5b8f5bbc37076036997b3dd8b0137252bcb826&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=L2N0O2%2FoSTSXv8wPViTIQPuGZGJqQ08D28mgubIhVLE%3D&reserved=0"
                  target="_blank" moz-do-not-send="true">
https://github.com/llvm/llvm-zorg/commit/3c5b8f5bbc37076036997b3dd8b0137252bcb826</a>
                in place, you can add the tag "silent" to your
                production builder, and it will not send notifications
                to the blame list. You can set the exact notifications
                you want in the master/config/status.py for that
                builder. Hope this helps you.<o:p></o:p></p>
            </div>
          </div>
        </blockquote>
        <div>
          <p class="MsoNormal"><o:p> </o:p></p>
        </div>
        <div>
          <p class="MsoNormal">Fantastic! I'll use this for the next
            steps for my bots (when I get back to it, I slacked on this
            recently...) :)<o:p></o:p></p>
        </div>
        <div>
          <p class="MsoNormal"><o:p> </o:p></p>
        </div>
        <div>
          <p class="MsoNormal">We may also use this on flaky bots in the
            future?<o:p></o:p></p>
        </div>
        <div>
          <p class="MsoNormal"><o:p> </o:p></p>
        </div>
        <div>
          <p class="MsoNormal">Thanks,<o:p></o:p></p>
        </div>
        <div>
          <p class="MsoNormal"><o:p> </o:p></p>
        </div>
        <div>
          <p class="MsoNormal">-- <o:p></o:p></p>
        </div>
        <div>
          <p class="MsoNormal">Mehdi <o:p></o:p></p>
        </div>
        <div>
          <p class="MsoNormal"><o:p> </o:p></p>
        </div>
        <div>
          <div>
            <p class="MsoNormal">I do not want to have the staging even
              able to send emails. We debug and test many things there,
              including notifications, and there is always a risk of
              spam.<o:p></o:p></p>
          </div>
          <div>
            <p class="MsoNormal"><o:p> </o:p></p>
          </div>
          <div>
            <p class="MsoNormal">Thanks<o:p></o:p></p>
          </div>
          <div>
            <p class="MsoNormal"><o:p> </o:p></p>
          </div>
          <div>
            <p class="MsoNormal">Galina<o:p></o:p></p>
          </div>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <p class="MsoNormal">On Sun, Jan 9, 2022 at 6:07 PM David
            Blaikie via llvm-dev <<a
              href="mailto:llvm-dev@lists.llvm.org" target="_blank"
              moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>
            wrote:<o:p></o:p></p>
        </div>
        <div>
          <p class="MsoNormal">+1 to most of what Mehdi's said here -
            I'd love to see improvements in stability, though probably
            having some rigid delegation of responsibility (rather than
            relying on developers to judge whether it's a flaky test or
            flaky bot - that isn't always obvious, maybe it's only flaky
            on a particular configuration that that buildbot happens to
            test and the developer doesn't have access to - then which
            is it?) might help (eg: if it's at all unclear, then the
            assumption is that it's always the test or always the
            buildbot owner - and an expectation that the author or owner
            then takes responsibility for working with the other party
            to address the issue, etc).<br>
            <br>
            That all said, disabling individual tests may risk no one
            caring enough to re-enable them, especially when the
            flakiness is found long after the change is made that
            introduced the test or flakiness (usually the case with
            flakiness - it takes a while to become apparent) - I don't
            really know how to address that issue. The "convenience"
            with disabling a buildbot is that there's other value to the
            buildbot (other than the flaky test that was providing
            negative value), so buildbot owners have more motivation to
            get the bot back online - though I don't want to burden
            buildbot owners unduly either (because they'd eventually
            give up on them) :/ <br>
            <br>
            - Dave<o:p></o:p></p>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <div>
            <p class="MsoNormal">On Sat, Jan 8, 2022 at 5:15 PM Mehdi
              AMINI via llvm-dev <<a
                href="mailto:llvm-dev@lists.llvm.org" target="_blank"
                moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>
              wrote:<o:p></o:p></p>
          </div>
          <blockquote style="border:none;border-left:solid #CCCCCC
            1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
            <div>
              <div>
                <p class="MsoNormal">Hi,<o:p></o:p></p>
              </div>
              <div>
                <p class="MsoNormal"><o:p> </o:p></p>
              </div>
              <div>
                <p class="MsoNormal">First: thanks a lot Stella for
                  being a bot owner and providing valuable resources to
                  the community. The sequence of even is really
                  unfortunate here, and thank you for bringing it up to
                  everyone's attention, let's try to improve our
                  processes.<o:p></o:p></p>
              </div>
              <p class="MsoNormal"><o:p> </o:p></p>
              <div>
                <div>
                  <p class="MsoNormal">On Sat, Jan 8, 2022 at 1:01 PM
                    Philip Reames via llvm-dev <<a
                      href="mailto:llvm-dev@lists.llvm.org"
                      target="_blank" moz-do-not-send="true">llvm-dev@lists.llvm.org</a>>
                    wrote:<o:p></o:p></p>
                </div>
                <blockquote style="border:none;border-left:solid #CCCCCC
                  1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                  <div>
                    <p>Stella,<o:p></o:p></p>
                    <p>Thank you for raising the question.  This is a
                      great discussion for us to have publicly.<o:p></o:p></p>
                    <p>So folks know, I am the individual Stella
                      mentioned below.  I'll start with a bit of history
                      so that everyone's on the same page, then dive
                      into the policy question.<o:p></o:p></p>
                    <p>My general take is that buildbots are only useful
                      if failure notifications are generally
                      actionable.  A couple months back, I was on the
                      edge of setting up mail filter rules to
                      auto-delete a bunch of bots because they were
                      regularly broken, and decided I should try to be
                      constructive first.  In the first wave of that, I
                      emailed a couple of bot owners about things which
                      seemed like false positives. 
                      <o:p></o:p></p>
                    <p>At the time, I thought it was the bot owners
                      responsibility to not be testing a flaky
                      configuration.  I got a bit of push back on that
                      from a couple sources - Stella was one - and put
                      that question on hold.  This thread is a great
                      opportunity to decide what our policy actually is,
                      and document it.  <o:p></o:p></p>
                    <p>In the meantime, I've been working with Galina to
                      document existing practice where we could, and to
                      try to identify best practices on setting up
                      bots.  These changes have been posted publicly,
                      and reviewed through the normal process.  We've
                      been deliberately trying to stick to
                      non-controversial stuff as we got the docs
                      improved.  I've been actively reaching out to bot
                      owners to gather feedback in this process, but
                      Stella had not, yet, been one.
                      <o:p></o:p></p>
                    <p>Separately, this week I noticed a bot which was
                      repeatedly toggling between red and green.  I
                      forget the exact ratio, but in the recent build
                      history, there were multiple transitions,
                      seemingly unrelated to the changes being
                      committed.  I emailed Galina asking her to
                      address, and she removed the buildbot until it
                      could be moved to the staging buildmaster,
                      addressed, and then restored.  I left Stella off
                      the initial email.  Sorry about that, no ill
                      intent, just written in a hurry. 
                      <o:p></o:p></p>
                    <p>Now, transitioning into a bit of policy
                      discussion...<o:p></o:p></p>
                    <p>From my conversations with existing bot owners,
                      there is a general agreement that bots should only
                      be notifying the community if they are stable
                      enough.  There's honest disagreement on what the
                      bar for stable enough is, and disagreement about
                      exactly whose responsibility addressing new
                      instability is.  (To be clear, I'd separate
                      instability from a clear deterministic breakage
                      caused by a commit - we have a lot more agreement
                      on that.)<o:p></o:p></p>
                    <p>My personal take is that for a bot to be publicly
                      notifying, "someone" needs to take the
                      responsibility to backstop the normal revert to
                      green process.  This "someone" can be developers
                      who work in a particular area, the bot owner, or
                      some combination thereof.  I view the
                      responsibility of the bot config owner as being
                      the person responsible for making sure that
                      backstopping is happening.  Not necessarily by
                      doing it themselves, but by having the contacts
                      with developers who can, and following up when the
                      normal flow is not working.<o:p></o:p></p>
                    <p>In this particular example, we appear to have a
                      bunch of flaky lldb tests.  I personally know
                      absolutely nothing about lldb.  I have no idea
                      whether the tests are badly designed, the system
                      they're being run on isn't yet supported by lldb,
                      or if there's some recent code bug introduced
                      which causes the failure.  "Someone" needs to take
                      the responsibility of figuring that out, and in
                      the meantime spaming developers with inactionable
                      failure notices seems undesirable. 
                      <o:p></o:p></p>
                  </div>
                </blockquote>
                <div>
                  <p class="MsoNormal"><o:p> </o:p></p>
                </div>
                <div>
                  <p class="MsoNormal">I generally agree with the
                    overall sentiment. I would add that something worse
                    differentiating is that the source of flakiness can
                    be coming from the bot itself (flaky hardware /
                    fragile setup), or from the test/codebase itself (a
                    flaky bot may just be a deterministic ASAN failure).<o:p></o:p></p>
                </div>
                <div>
                  <p class="MsoNormal">Of course from Philip's point of
                    view it does not matter: the effect on the developer
                    is similar, we get undesirable and unactionable
                    notifications. From the maintenance flow however, it
                    matters in that the "someone" who has to take
                    responsibility is often not the same group of folks.<o:p></o:p></p>
                </div>
                <div>
                  <p class="MsoNormal">Also when encountering flaky
                    tests, the best action may not be to disable the bot
                    itself but instead to disable the test itself! (and
                    file a bug against the test owner...).<o:p></o:p></p>
                </div>
                <div>
                  <p class="MsoNormal"><o:p> </o:p></p>
                </div>
                <div>
                  <p class="MsoNormal">One more dimension that seems to
                    surface here may be different practices or
                    expectations across subprojects, for example here
                    the LLDB folks may be used to having some flaky
                    tests, but they trigger on changes to LLVM itself,
                    where we may not expect any flakiness (or so).<o:p></o:p></p>
                </div>
                <div>
                  <p class="MsoNormal"> <o:p></o:p></p>
                </div>
                <blockquote style="border:none;border-left:solid #CCCCCC
                  1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                  <div>
                    <p>For context, the bot was disabled until it could
                      be moved to the staging buildmaster.  Moving to
                      staging is required (currently) to disable
                      developer notification.  In the email from Galina,
                      it seems clear that the bot would be fine to move
                      back to production once the issue was triaged. 
                      This seems entirely reasonable to me.  <o:p></o:p></p>
                  </div>
                </blockquote>
                <div>
                  <p class="MsoNormal"><o:p> </o:p></p>
                </div>
                <div>
                  <p class="MsoNormal">Something quite annoying with
                    staging is that it does not have (as far as I know)
                    a way to continue to notify the buildbot owner. I
                    don't really care about staging vs prod as much as
                    having a mode to just "not notify the blame list" /
                    "only notify the owner".<o:p></o:p></p>
                </div>
                <div>
                  <p class="MsoNormal"><o:p> </o:p></p>
                </div>
                <div>
                  <p class="MsoNormal">-- <o:p></o:p></p>
                </div>
                <div>
                  <p class="MsoNormal">Mehdi<o:p></o:p></p>
                </div>
                <div>
                  <p class="MsoNormal"><o:p> </o:p></p>
                </div>
                <div>
                  <p class="MsoNormal"> <o:p></o:p></p>
                </div>
                <blockquote style="border:none;border-left:solid #CCCCCC
                  1.0pt;padding:0in 0in 0in
6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">
                  <div>
                    <p>Philip<o:p></o:p></p>
                    <p>p.s. One thing I'll note as a definite problem
                      with the current system is that a lot of this
                      happens in private email, and it's hard to share
                      so that everyone has a good picture of what's
                      going on.  It makes miscommunications all too
                      easy.  Last time I spoke with Galina, we were
                      tentative planning to start using github issues
                      for bot operation matters to address that, but as
                      that was in the middle of the transition from
                      bugzilla, we deferred and haven't gotten back to
                      that yet.<o:p></o:p></p>
                    <p>p.p.s. The bot in question is <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flab.llvm.org%2Fbuildbot%2F%23%2Fbuilders%2F83&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=TR4I18%2FuGHgNwK0PZprdHwg9gVikDWaUWEIXqDU5EQo%3D&reserved=0"
                        target="_blank" moz-do-not-send="true">
                        https://lab.llvm.org/buildbot/#/builders/83</a>
                      if folks want to examine the history themselves. 
                      <o:p></o:p></p>
                    <div>
                      <p class="MsoNormal">On 1/8/22 12:06 PM, Stella
                        Stamenova via llvm-dev wrote:<o:p></o:p></p>
                    </div>
                    <blockquote
                      style="margin-top:5.0pt;margin-bottom:5.0pt">
                      <div>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Hey
                          all,<o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I
                          have a couple of questions about what the
                          responsibilities of a buildbot owner are. I’ve
                          been maintaining a couple of buildbots for
                          lldb and mlir for some time now and I thought
                          I had a pretty good idea of what is required
                          based on the documentation here: <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.llvm.org%2Fdocs%2FHowToAddABuilder.html&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=vmuwSe4aJvyZaseAsRONqnwQT5AE2j8Fsey6n2X8aow%3D&reserved=0"
                            target="_blank" moz-do-not-send="true">
                            How To Add Your Build Configuration To LLVM
                            Buildbot Infrastructure — LLVM 13
                            documentation</a><o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">My
                          understanding was that there are some things
                          that are *<b>expected</b>* of the owner.
                          Namely:<o:p></o:p></p>
                        <ol type="1" start="1">
                          <li class="MsoNormal"
                            style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l1
                            level1 lfo6">
                            Make sure that the buildbot is connected and
                            has the right infrastructure (e.g. the right
                            version of Python, or tools, etc.). Update
                            as needed.<o:p></o:p></li>
                          <li class="MsoNormal"
                            style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l1
                            level1 lfo6">
                            Make sure that the build configuration is
                            one that is supported (e.g. supported flavor
                            or cmake variables). Update as needed.<o:p></o:p></li>
                        </ol>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">There
                          are also a couple of things that are *<b>optional</b>*,
                          but nice to have:<o:p></o:p></p>
                        <ol type="1" start="3">
                          <li class="MsoNormal"
                            style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l5
                            level1 lfo9">
                            If the buildbot stays red for a while (where
                            “a while” is completely subjective), figure
                            out the patch or patches that are causing an
                            issue and either revert them or notify the
                            authors, so they can take action.<o:p></o:p></li>
                          <li class="MsoNormal"
                            style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l5
                            level1 lfo9">
                            If someone is having trouble investigating a
                            failure that only happens on the buildbot
                            (or the buildbot is a rare configuration),
                            help them out (e.g. collect logs if
                            possible).<o:p></o:p></li>
                        </ol>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Up
                          to now, I’ve not had any issues with this and
                          the community has been very good at fixing
                          issues with builds and tests when I point them
                          out, or more often than not, without me having
                          to do anything but the occasional test re-run
                          and software update (like this one, for
                          example,
                          <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Freviews.llvm.org%2FD114639&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=ppf4tXWRAK7cf68FMTvaZqIQhkCelgDJKOrkbrhUST4%3D&reserved=0"
                            target="_blank" moz-do-not-send="true">
                            <span style="font-family:"Segoe UI
                              Emoji",sans-serif">⚙</span> D114639
                            Raise the minimum Visual Studio version to
                            VS2019 (llvm.org)</a>). lldb has some tests
                          that are flaky because of the nature of the
                          product, so there is some noise, but mostly
                          things work well and everyone seems happy.<o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I’ve
                          recently run into a situation that makes me
                          wonder whether there are other expectations of
                          a buildbot owner that are not explicitly
                          listed in the llvm documentation. Someone
                          reached out to me some time ago to let me know
                          their unhappiness at the flakiness of some of
                          the lldb tests and demanded that I either fix
                          them or disable them. I let them know that
                          there are some tests that are known to be
                          flaky, that my expectation is that it is not
                          my responsibility to fix all such issues and
                          that the community would be very happy to have
                          their contribution in the form of a fix or a
                          change to disable the tests. I didn’t get a
                          response from this person, but I did disable a
                          couple of particularly flaky tests since it
                          seemed like the nice thing to do.<o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">The
                          real excitement happened yesterday when I
                          received an email that *<b>the build bot had
                            been turned off</b>*. This same person
                          reached out to the powers that be (without
                          letting me know) and asked them explicitly to
                          silence it *<b>without my active involvement</b>*
                          because of the flakiness.<o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I
                          have a couple of issues with this approach but
                          perhaps I’ve misunderstood what my
                          responsibilities are as the buildbot owner. I
                          know it is frustrating to see a bot fail
                          because of flaky tests and it is nice to have
                          someone to ask to resolve them all – is that
                          really the expectation of a buildbot owner?
                          Where is the line between maintenance of the
                          bot and fixing build and test issues for the
                          community?<o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">I’d
                          like to understand what the general
                          expectations are and if there are things
                          missing from the documentation, I propose that
                          we add them, so that it is clear for everyone
                          what is required.<o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">Thanks,<o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto">-Stella<o:p></o:p></p>
                        <p class="MsoNormal"
                          style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto"> <o:p></o:p></p>
                      </div>
                      <p class="MsoNormal"><o:p> </o:p></p>
                      <pre>_______________________________________________<o:p></o:p></pre>
                      <pre>LLVM Developers mailing list<o:p></o:p></pre>
                      <pre><a href="mailto:llvm-dev@lists.llvm.org" target="_blank" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><o:p></o:p></pre>
                      <pre><a href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0" target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></pre>
                    </blockquote>
                  </div>
                  <p class="MsoNormal">_______________________________________________<br>
                    LLVM Developers mailing list<br>
                    <a href="mailto:llvm-dev@lists.llvm.org"
                      target="_blank" moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br>
                    <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0"
                      target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>
                </blockquote>
              </div>
            </div>
            <p class="MsoNormal">_______________________________________________<br>
              LLVM Developers mailing list<br>
              <a href="mailto:llvm-dev@lists.llvm.org" target="_blank"
                moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br>
              <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0"
                target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>
          </blockquote>
        </div>
        <p class="MsoNormal">_______________________________________________<br>
          LLVM Developers mailing list<br>
          <a href="mailto:llvm-dev@lists.llvm.org" target="_blank"
            moz-do-not-send="true">llvm-dev@lists.llvm.org</a><br>
          <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.llvm.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fllvm-dev&data=04%7C01%7Cstilis%40microsoft.com%7C145340bbb1db4977407708d9d665bf44%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637776554786765265%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=bBQTLiLtm0yM1tpg06K3l%2Fgc3qCKN7PYKJywOIsw61I%3D&reserved=0"
            target="_blank" moz-do-not-send="true">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><o:p></o:p></p>
      </div>
    </blockquote>
  </body>
</html>