[PATCH] D84258: [buildbot] Added config files for CUDA build bots

Wed Jul 22 11:13:47 PDT 2020

tra added a comment.

In D84258#2166069 <https://reviews.llvm.org/D84258#2166069>, @kuhnel wrote:

> Why do you want to double the config files and scripts?

terraform script can be merged, but...

> Why create another cluster and another node pool?

I don't want to nuke an already-running mlir cluster when I'm changing something in my setup. Many terraform operations result in 'tear down everything and create it from scratch'. 
Also, MLIR's requirements are not exactly identical to my CUDA bot requirements. If you've noticed, the VMs in cudabot clusters have notably different configuration from the MLIR's ones.

> We can share these across our machines.

I'm not sure about that. Both MLIR and CUDA need a GPU to work with and GPUs are not shareable. So, if a machine already runs a pod which requested a GPU. So, in the end you will need the same number of VMs w/ GPUs. You could share the controller, but that's a negligible cost compared to everything else. In addition to that, the VM configuration for CUDA bots is tweaked for the CUDA buildbot workload. One of the pools has 24 cores, while the other two run with only 8 (and I may further reduce it). That arrangement may not be the right one for the MLIR.

Granted, we could create a pool for each possible configuration we may need, but considering that the cluster itself may need to be torn down to reconfigure, I believe we're currently better off keeping MLIR and CUDA clusters separate. We could keep config in the same file, but considering that the bots are substantially different, I don't see it buying us much, while it increases the risk of accidentally changing something in the wrong setup as they both run within the same GCP project.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D84258/new/

https://reviews.llvm.org/D84258