[PATCH] D84258: [buildbot] Added config files for CUDA build bots

Fri Jul 24 07:52:50 PDT 2020

kuhnel added a comment.

In D84258#2167464 <https://reviews.llvm.org/D84258#2167464>, @tra wrote:

>

>> Why create another cluster and another node pool?
> 
> I don't want to nuke an already-running mlir cluster when I'm changing something in my setup. Many terraform operations result in 'tear down everything and create it from scratch'. 
>  Also, MLIR's requirements are not exactly identical to my CUDA bot requirements. If you've noticed, the VMs in cudabot clusters have notably different configuration from the MLIR's ones.

The buildbots get restarted every 24h anyway. So I suppose they can handle 1-2 more restarts. I also would not expect that many re-deployments of the cluster or of the node pools. At least for my setup this has become relatively stable. Minor changes can even be done on the fly. I only re-deployed the cluster today to move from 16 to 32 cores. And that causes a re-deployment anyway.

>> We can share these across our machines.
> 
> I'm not sure about that. Both MLIR and CUDA need a GPU to work with and GPUs are not shareable. So, if a machine already runs a pod which requested a GPU. So, in the end you will need the same number of VMs w/ GPUs. You could share the controller, but that's a negligible cost compared to everything else. In addition to that, the VM configuration for CUDA bots is tweaked for the CUDA buildbot workload. One of the pools has 24 cores, while the other two run with only 8 (and I may further reduce it). That arrangement may not be the right one for the MLIR.

I would not run multiple containers on one VM. As you said, k8s cannot share one GPUs across containers. I would rather create one "build slave" per VM (or a group of "build slaves" each in a separate VM) in buildbot and then have that VM(s) execute a set of "builders". We could have an m:n mapping of "build slaves" and "builders".

My mlir-nvidia builder is not very picky. It would probably run on any of your machines as long as it has an Nvidia card. Sorry about the non-inclusive wording here, but that's what buildbot calls them in the UI.

> Granted, we could create a pool for each possible configuration we may need, but considering that the cluster itself may need to be torn down to reconfigure, I believe we're currently better off keeping MLIR and CUDA clusters separate. We could keep config in the same file, but considering that the bots are substantially different, I don't see it buying us much, while it increases the risk of accidentally changing something in the wrong setup as they both run within the same GCP project.

But yes, having a tighter coupling would increase the number of conflicts of parallel edits. We would also somehow have to make sure we're not trying to deploy two different things in parallel...

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D84258/new/

https://reviews.llvm.org/D84258