[clang] [Clang] Add timeout for GPU detection utilities (PR #94751)

Fri Jun 7 11:32:14 PDT 2024

Artem-B wrote:

Ooh... I think I know exactly what may be causing this.

On machines where NVIDIA GPUs are used for compute only (e.g. a headless server machine), NVIDIA drivers are not always loaded by default and may not have driver persistence enabled. The drivers get loaded when GPU is accessed, and then released and unloaded when there are no GPU users remaining. A parallel compilation with `--offload-arch=native` will be the worst-case stress test scenario for the driver init/deinit machinery, as GPU probing is both short-living and will be done repeatedly.

Adding a timeout here would help, sort of, but it would be much better if we could figure out a way to either detect that GPU probing takes too long (and likely causes the driver to load/unload), or cache probing results somehow, so we do not have to run the same detection over and over again. This is a point towards pushing the detection out of clang into the build system, which would be the better place to do it.

For the GPU detection, we may be able to work around the issue by leaving the detection app running for the duration of the compilation, and prevent driver unloading, but it's a rather gross hack.

https://github.com/llvm/llvm-project/pull/94751