[PATCH] D84258: [buildbot] Added config files for CUDA build bots

Artem Belevich via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Fri Jul 24 11:18:26 PDT 2020


tra added a comment.

In D84258#2172337 <https://reviews.llvm.org/D84258#2172337>, @kuhnel wrote:

> I would not run multiple containers on one VM. As you said, k8s cannot share one GPUs across containers. I would rather create one "build slave" per VM (or a group of "build slaves" each in a separate VM) in buildbot and then have that VM(s) execute a set of "builders". We could have an m:n mapping of "build slaves" and "builders".
>
> My mlir-nvidia builder is not very picky. It would probably run on any of your machines as long as it has an Nvidia card. Sorry about the non-inclusive wording here, but that's what buildbot calls them in the UI.


That may be doable. 
At the moment, all CUDA bots do their own build of test-suite tests. I'm planning to figure out how to build them once and get the GPU-enabled machines to only do tests. 
GPU tests are relatively fast, compared to building them so there will be plenty of time to share. 
If MLIR can also split build from test, then sharing GPU-enabled VM for multiple  builders makes sense. 
However, as long as each builder is expected to compile something substantial, sharing will be at the expense of higher latency for the bot results -- one of the issues we want to fix here.

>> Granted, we could create a pool for each possible configuration we may need, but considering that the cluster itself may need to be torn down to reconfigure, I believe we're currently better off keeping MLIR and CUDA clusters separate. We could keep config in the same file, but considering that the bots are substantially different, I don't see it buying us much, while it increases the risk of accidentally changing something in the wrong setup as they both run within the same GCP project.
> 
> But yes, having a tighter coupling would increase the number of conflicts of parallel edits. We would also somehow have to make sure we're not trying to deploy two different things in parallel...

Interlocking two builders is relatively easy if we use annotated builder scripts. We can just add `flock /builder/global-build-lock` to the build scripts at strategic points.

Let's keep the clusters & pools separate for now. We'll revisit the issue once I evolve the setup to have separate build/test machines. Then we can consider consolidating things.


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D84258/new/

https://reviews.llvm.org/D84258





More information about the llvm-commits mailing list