[PATCH] D114325: Add a best practice section on how to configure a fast builder

Mon Nov 22 10:59:14 PST 2021

reames marked an inline comment as done.
reames added a comment.

In D114325#3146783 <https://reviews.llvm.org/D114325#3146783>, @dblaikie wrote:

> Generally seems like a good direction to go in, if it's feasible - but it seems like a pretty big step up in resource expectations than the past/current buildbots/workers, so I'm not sure how feasible it is to get those resources.

Er, no.  This is very much documenting current status.  While we have a bunch of batch builders, we also have a bunch which do build every commit.

See every builder with collapseRequests: false in https://github.com/llvm/llvm-zorg/blob/main/buildbot/osuosl/master/config/builders.py.  Admittedly, this is only 12 out of 142 registered builders, but that doesn't count all the ones which keep up in practice and simply haven't committed to it.

================
Comment at: llvm/docs/HowToAddABuilder.rst:154-155
+
+As mentioned above, we generally have a strong preference for
+builders which can build every commit as they come in.  This section
+includes best practices and some recommendations as to how to achieve
----------------
rengolin wrote:
> dblaikie wrote:
> > do we have any builders that achieve this consistently (I wouldn't think so, given the resources required)? Maybe worth rephrasing if it's  not actually achievable/achieved generally to something more in line with the practical reality?
> > 
> > If this document is more aspirational/trying to set a fairly new (albeit good, but perhaps not feasible?) direction - maybe it'd be more suitable in a different form/forum?
> I don't think we have many, if any, but I interpreted it as "preference" and "best practices", not that we don't accept others. I agree we shouldn't be discouraging people to set buildbots if they can't follow these guidelines.
As noted in the top level comment, we have a bunch of builders which do keep up building every commit.  

This is aspirational, but only in the sense that a new builder which can't meet this bar has to explain why it's still worthwhile having as a notifying builder.  We may accept it, but the burden of justification is definitely on the bot owner.  

The main glide path I see - which we need better infrastructure for - is allowing "small" (2-3) commit batches as a graceful fallback when fully keeping up isn't practical.  

================
Comment at: llvm/docs/HowToAddABuilder.rst:212-216
+  Using ccache materially improves average build times.  Incremental builds
+  can be slightly faster, but introduce the risk of build corruption due to
+  e.g. state changes, etc...  At this point, the recommendation is not to
+  use incremental builds and instead use ccache as the latter captures the
+  majority of the benefit with less risk of false positives.
----------------
rengolin wrote:
> dblaikie wrote:
> > Seems like we should figure out how to make incremental builds more reliable - to benefit developers (& then have buildbots using incremental builds to ensure they do keep working so developers can benefit from them being reliable). But, yeah, if it's just not practical today, so be it.
> for a number of years I used incremental builds on Arm with very little trouble. I had to clean the build directory perhaps a couple of times a year when something (that I don't remember) happened, but otherwise it was way better than full builds and ccache (due to using SSD or USB2 disks on dev boards).
I agree with the goal, but this document's purpose is to provide best practice for current reality.  If we get incremental builds working reliably, we can change the recommendation.  

================
Comment at: llvm/docs/HowToAddABuilder.rst:224-227
+  With multiple workers, it is tempting to try to configure a shared cache
+  between the workers.  Experience to date indicates this is difficult to
+  well, and that having local per-worker caches gets most of the benefit
+  anyways.  We don't currently recommend shared caches.
----------------
rengolin wrote:
> dblaikie wrote:
> > Is this about multiple workers on the same machine, or some kind of network shared cache? Presumably if we're suggesting people have multiple workers per builder (to get fast enough cycle time/short enough blame list) - that's multiple machines (since generally we could get enough parallelism to saturate a machine in the build - I guess not all the time, so maybe there's some parallelism benefit to multiple workers on the same machine?)
> I interpreted as a network cache. I suppose it could be the same machine, too, though it would use the cache in a similar way if you use containers, for example.
> 
> Low memory machines have to restrict linking parallel settings, so running two builds at the same time could still OOM-kill builds. High memory machines (using LLD on release mode) have the linking phase fast enough that multiple builds tend to not help much. GCC builds used to be much less parallel than LLVM, so it worked well for them.
> 
> For a while, for Arm64, we didn't have a lot of machines, so we put multiple (different) builders on the same machine, but that couldn't use the same cache anyway.
I don't know.  The little bit of discussion I've had with existing bot owners is mostly around the problems introduced with multiple workers on the same machine.  I don't currently have any guidelines to suggest on this balancing act and thus left it out.

If you have experience with this, let's talk offline.  

================
Comment at: llvm/docs/HowToAddABuilder.rst:236-238
+  As a last resort, you can configure your builder to batch build requests.
+  This makes the build failure notifications markedly less actionable, and
+  should only be done once all other reasonable measures have been taken.
----------------
dblaikie wrote:
> That's the default/what most (all?) of the buildbots are doing today, though, yeah?
See top level comments.

================
Comment at: llvm/docs/HowToAddABuilder.rst:198
+  generally provides a good balance between build times and bug detection for
+  most buildbots.
+
----------------
dblaikie wrote:
> reames wrote:
> > mehdi_amini wrote:
> > > rengolin wrote:
> > > > `RelWithDebugInfo` is perhaps even more helpful, because you test the optimisation pipeline, get smaller objects to link, and still, in case of stack traces, you can see from the logs directly where to begin looking.
> > > RelWithDebugInfo seems a bit heavy to me: the objects gets ~10x larger IRRC.
> > > If what you're about is better stack traces in case of crashes, then `-gmlt` (line tables only) gets it to you without blowing up the disk size / link time.
> > I think I managed to address this with the revised wording, let me know if further tweaking is warranted.
> Split DWARF can reduce linker input size/time too, for what that's worth - but maybe enough in the noise/weeds/details to be omitted here.
If someone with experience can provide guidance here, I'm happy to write it up.  I just haven't heard successful deployments yet.  

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114325/new/

https://reviews.llvm.org/D114325