[PATCH] D140433: [Clang] Add `nvptx-arch` tool to query installed NVIDIA GPUs

Tue Jan 3 16:35:27 PST 2023

jhuber6 added inline comments.

================
Comment at: clang/tools/nvptx-arch/CMakeLists.txt:19
+if (NOT CUDA_FOUND OR NOT cuda-library)
+  message(STATUS "Not building nvptx-arch: cuda runtime not found")
+  return()
----------------
tra wrote:
> Nit: libcuda.so is part of the NVIDIA driver which provides NVIDIA driver API , It has nothing to do with the CUDA runtime.
> Here, it's actually not even the libcuda.so itself that's not found, but it's stub. 
> I think a sensible error here should say "Failed to find stubs/libcuda.so in CUDA_LIBDIR"
Good point. Never thought about the difference because they're both called `cuda` somewhere.

================
Comment at: clang/tools/nvptx-arch/CMakeLists.txt:25
+
+set_target_properties(nvptx-arch PROPERTIES INSTALL_RPATH_USE_LINK_PATH ON)
+target_include_directories(nvptx-arch PRIVATE ${CUDA_INCLUDE_DIRS})
----------------
tra wrote:
> Does it mean that the executable will have RPATH pointing to CUDA_LIBDIR/stubs?
> 
> This should not be necessary. The stub shipped with CUDA comes as "libcuda.so" only. It's SONAME is libcuda.so.1, but there's no symlink with that name in stubs, so RPATH pointing there will do nothing. At runtime, dynamic linker will attempt to open libcuda.so.1 and it will only be found among the actual libraries installed by NVIDIA drivers.
> 
> 
Interesting, I can probably delete it. Another thing I mostly just copied from the existing tool.

================
Comment at: clang/tools/nvptx-arch/NVPTXArch.cpp:26
+#if !CUDA_HEADER_FOUND
+int main() { return 1; }
+#else
----------------
tra wrote:
> How do we distinguish "we didn't have CUDA at build time" reported here from "some driver API failed with CUDA_ERROR_INVALID_VALUE=1" ?
> 
I guess the latter would print an error message. We do the same thing with the `amdgpu-arch` so I just copied it.

================
Comment at: clang/tools/nvptx-arch/NVPTXArch.cpp:34
+  const char *ErrStr = nullptr;
+  CUresult Result = cuGetErrorString(Err, &ErrStr);
+  if (Result != CUDA_SUCCESS)
----------------
tra wrote:
> One problem with this approach is that `nvptx-arch` will fail to run on a machine without NVIDIA drivers installed because dynamic linker will not find `libcuda.so.1`.
> 
> Ideally we want it to run on any machine and fail the way we want.
> 
> A typical way to achieve that is to dlopen("libcuda.so.1"), and obtain the pointers to the functions we need via `dlsym()`.
> 
> 
We do this in the OpenMP runtime. I mostly copied this approach from the existing `amdgpu-arch` but we could change both to use this method.

================
Comment at: clang/tools/nvptx-arch/NVPTXArch.cpp:63
+
+    printf("sm_%d%d\n", Major, Minor);
+  }
----------------
tra wrote:
> jhuber6 wrote:
> > tianshilei1992 wrote:
> > > Do we want to include device number here?
> > For `amdgpu-arch` and here we just have it implicitly in the order, so the n-th line is the n-th device, i.e.
> > ```
> > sm_70 // device 0
> > sm_80 // device 1
> > sm_70 // device 2
> > ```
> NVIDIA GPU enumeration order is more or less arbitrary. By default it's arranged by "sort of fastest GPU first", but can be rearranged in order of PCI(e) bus IDs or in an arbitrary user-specified order using `CUDA_VISIBLE_DEVICES`. Printing compute capability in the enumeration order is pretty much all the user needs.  If we want to print something uniquely identifying the device, we would need to pring the device UUID, similarly to what `nvidia-smi -L` does. Or PCIe bus IDs. In other words -- we can uniquely identify devices, but there's no such thing as inherent canonical order among the devices.
I think it's mostly just important that it prints a valid GPU. Most of the uses for this tool will just be "Give me a valid GPU I can run on this machine".

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D140433/new/

https://reviews.llvm.org/D140433