<div dir="ltr"><div>Hi,</div><div><br></div><div>First: thanks a lot Stella for being a bot owner and providing valuable resources to the community. The sequence of even is really unfortunate here, and thank you for bringing it up to everyone's attention, let's try to improve our processes.</div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 8, 2022 at 1:01 PM Philip Reames via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex">

  <div>

    <p>Stella,</p>

    <p>Thank you for raising the question.  This is a great discussion

      for us to have publicly.</p>

    <p>So folks know, I am the individual Stella mentioned below.  I'll

      start with a bit of history so that everyone's on the same page,

      then dive into the policy question.</p>

    <p>My general take is that buildbots are only useful if failure

      notifications are generally actionable.  A couple months back, I

      was on the edge of setting up mail filter rules to auto-delete a

      bunch of bots because they were regularly broken, and decided I

      should try to be constructive first.  In the first wave of that, I

      emailed a couple of bot owners about things which seemed like

      false positives.  <br>

    </p>

    <p>At the time, I thought it was the bot owners responsibility to

      not be testing a flaky configuration.  I got a bit of push back on

      that from a couple sources - Stella was one - and put that

      question on hold.  This thread is a great opportunity to decide

      what our policy actually is, and document it.  <br>

    </p>

    <p>In the meantime, I've been working with Galina to document

      existing practice where we could, and to try to identify best

      practices on setting up bots.  These changes have been posted

      publicly, and reviewed through the normal process.  We've been

      deliberately trying to stick to non-controversial stuff as we got

      the docs improved.  I've been actively reaching out to bot owners

      to gather feedback in this process, but Stella had not, yet, been

      one. <br>

    </p>

    <p>Separately, this week I noticed a bot which was repeatedly

      toggling between red and green.  I forget the exact ratio, but in

      the recent build history, there were multiple transitions,

      seemingly unrelated to the changes being committed.  I emailed

      Galina asking her to address, and she removed the buildbot until

      it could be moved to the staging buildmaster, addressed, and then

      restored.  I left Stella off the initial email.  Sorry about that,

      no ill intent, just written in a hurry.  <br>

    </p>

    <p>Now, transitioning into a bit of policy discussion...</p>

    <p>From my conversations with existing bot owners, there is a

      general agreement that bots should only be notifying the community

      if they are stable enough.  There's honest disagreement on what

      the bar for stable enough is, and disagreement about exactly whose

      responsibility addressing new instability is.  (To be clear, I'd

      separate instability from a clear deterministic breakage caused by

      a commit - we have a lot more agreement on that.)</p>

    <p>My personal take is that for a bot to be publicly notifying,

      "someone" needs to take the responsibility to backstop the normal

      revert to green process.  This "someone" can be developers who

      work in a particular area, the bot owner, or some combination

      thereof.  I view the responsibility of the bot config owner as

      being the person responsible for making sure that backstopping is

      happening.  Not necessarily by doing it themselves, but by having

      the contacts with developers who can, and following up when the

      normal flow is not working.</p>

    <p>In this particular example, we appear to have a bunch of flaky

      lldb tests.  I personally know absolutely nothing about lldb.  I

      have no idea whether the tests are badly designed, the system

      they're being run on isn't yet supported by lldb, or if there's

      some recent code bug introduced which causes the failure. 

      "Someone" needs to take the responsibility of figuring that out,

      and in the meantime spaming developers with inactionable failure

      notices seems undesirable.  <br></p></div></blockquote><div><br></div><div>I generally agree with the overall sentiment. I would add that something worse differentiating is that the source of flakiness can be coming from the bot itself (flaky hardware / fragile setup), or from the test/codebase itself (a flaky bot may just be a deterministic ASAN failure).</div><div>Of course from Philip's point of view it does not matter: the effect on the developer is similar, we get undesirable and unactionable notifications. From the maintenance flow however, it matters in that the "someone" who has to take responsibility is often not the same group of folks.</div><div>Also when encountering flaky tests, the best action may not be to disable the bot itself but instead to disable the test itself! (and file a bug against the test owner...).</div><div><br></div><div>One more dimension that seems to surface here may be different practices or expectations across subprojects, for example here the LLDB folks may be used to having some flaky tests, but they trigger on changes to LLVM itself, where we may not expect any flakiness (or so).</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div><p>

    </p>

    <p>For context, the bot was disabled until it could be moved to the

      staging buildmaster.  Moving to staging is required (currently) to

      disable developer notification.  In the email from Galina, it

      seems clear that the bot would be fine to move back to production

      once the issue was triaged.  This seems entirely reasonable to

      me.  <br></p></div></blockquote><div><br></div><div>Something quite annoying with staging is that it does not have (as far as I know) a way to continue to notify the buildbot owner. I don't really care about staging vs prod as much as having a mode to just "not notify the blame list" / "only notify the owner".</div><div><br></div><div>-- </div><div>Mehdi</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div><p>

    </p>

    <p>Philip</p>

    <p>p.s. One thing I'll note as a definite problem with the current

      system is that a lot of this happens in private email, and it's

      hard to share so that everyone has a good picture of what's going

      on.  It makes miscommunications all too easy.  Last time I spoke

      with Galina, we were tentative planning to start using github

      issues for bot operation matters to address that, but as that was

      in the middle of the transition from bugzilla, we deferred and

      haven't gotten back to that yet.</p>

    <p>p.p.s. The bot in question is

      <a href="https://lab.llvm.org/buildbot/#/builders/83" target="_blank">https://lab.llvm.org/buildbot/#/builders/83</a> if folks want to

      examine the history themselves.  <br>

    </p>

    <div>On 1/8/22 12:06 PM, Stella Stamenova

      via llvm-dev wrote:<br>

    </div>

    <blockquote type="cite">

      <div>

        <p class="MsoNormal">Hey all,<u></u><u></u></p>

        <p class="MsoNormal"><u></u> <u></u></p>

        <p class="MsoNormal">I have a couple of questions about what the

          responsibilities of a buildbot owner are. I’ve been

          maintaining a couple of buildbots for lldb and mlir for some

          time now and I thought I had a pretty good idea of what is

          required based on the documentation here: <a href="https://www.llvm.org/docs/HowToAddABuilder.html" target="_blank">How To Add Your Build Configuration

            To LLVM Buildbot Infrastructure — LLVM 13 documentation</a><u></u><u></u></p>

        <p class="MsoNormal"><u></u> <u></u></p>

        <p class="MsoNormal">My understanding was that there are some

          things that are *<b>expected</b>* of the owner. Namely:<u></u><u></u></p>

        <ol style="margin-top:0in" type="1" start="1">

          <li style="margin-left:0in">Make sure

            that the buildbot is connected and has the right

            infrastructure (e.g. the right version of Python, or tools,

            etc.). Update as needed.<u></u><u></u></li>

          <li style="margin-left:0in">Make sure

            that the build configuration is one that is supported (e.g.

            supported flavor or cmake variables). Update as needed.<u></u><u></u></li>

        </ol>

        <p class="MsoNormal"><u></u> <u></u></p>

        <p class="MsoNormal">There are also a couple of things that are

          *<b>optional</b>*, but nice to have:<u></u><u></u></p>

        <ol style="margin-top:0in" type="1" start="3">

          <li style="margin-left:0in">If the

            buildbot stays red for a while (where “a while” is

            completely subjective), figure out the patch or patches that

            are causing an issue and either revert them or notify the

            authors, so they can take action.<u></u><u></u></li>

          <li style="margin-left:0in">If someone

            is having trouble investigating a failure that only happens

            on the buildbot (or the buildbot is a rare configuration),

            help them out (e.g. collect logs if possible).<u></u><u></u></li>

        </ol>

        <p class="MsoNormal"><u></u> <u></u></p>

        <p class="MsoNormal">Up to now, I’ve not had any issues with

          this and the community has been very good at fixing issues

          with builds and tests when I point them out, or more often

          than not, without me having to do anything but the occasional

          test re-run and software update (like this one, for example, <a href="https://reviews.llvm.org/D114639" target="_blank">

            <span>⚙</span> D114639 Raise the minimum

            Visual Studio version to VS2019 (llvm.org)</a>). lldb has

          some tests that are flaky because of the nature of the

          product, so there is some noise, but mostly things work well

          and everyone seems happy.<u></u><u></u></p>

        <p class="MsoNormal"><u></u> <u></u></p>

        <p class="MsoNormal">I’ve recently run into a situation that

          makes me wonder whether there are other expectations of a

          buildbot owner that are not explicitly listed in the llvm

          documentation. Someone reached out to me some time ago to let

          me know their unhappiness at the flakiness of some of the lldb

          tests and demanded that I either fix them or disable them. I

          let them know that there are some tests that are known to be

          flaky, that my expectation is that it is not my responsibility

          to fix all such issues and that the community would be very

          happy to have their contribution in the form of a fix or a

          change to disable the tests. I didn’t get a response from this

          person, but I did disable a couple of particularly flaky tests

          since it seemed like the nice thing to do.<u></u><u></u></p>

        <p class="MsoNormal"><u></u> <u></u></p>

        <p class="MsoNormal">The real excitement happened yesterday when

          I received an email that *<b>the build bot had been turned off</b>*.

          This same person reached out to the powers that be (without

          letting me know) and asked them explicitly to silence it *<b>without

            my active involvement</b>* because of the flakiness.<u></u><u></u></p>

        <p class="MsoNormal"><u></u> <u></u></p>

        <p class="MsoNormal">I have a couple of issues with this

          approach but perhaps I’ve misunderstood what my

          responsibilities are as the buildbot owner. I know it is

          frustrating to see a bot fail because of flaky tests and it is

          nice to have someone to ask to resolve them all – is that

          really the expectation of a buildbot owner? Where is the line

          between maintenance of the bot and fixing build and test

          issues for the community?<u></u><u></u></p>

        <p class="MsoNormal"><u></u> <u></u></p>

        <p class="MsoNormal">I’d like to understand what the general

          expectations are and if there are things missing from the

          documentation, I propose that we add them, so that it is clear

          for everyone what is required.<u></u><u></u></p>

        <p class="MsoNormal"><u></u> <u></u></p>

        <p class="MsoNormal">Thanks,<u></u><u></u></p>

        <p class="MsoNormal">-Stella<u></u><u></u></p>

        <p class="MsoNormal"><u></u> <u></u></p>

      </div>

      <br>

      <fieldset></fieldset>

      <pre>_______________________________________________

LLVM Developers mailing list

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>

</pre>

    </blockquote>

  </div>

_______________________________________________<br>

LLVM Developers mailing list<br>

<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>

<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

</blockquote></div></div>