[clang] [NVPTX] Add support for -march=native in standalone NVPTX (PR #79373)

Thu Jan 25 05:35:41 PST 2024

jhuber6 wrote:

> I think I'm with Art on this one.
> 
> > > Problem #2 [...] The arch=native will create a working configuration, but would build more than necessary.
> > 
> > 
> > It will target the first GPU it finds. We could maybe change the behavior to detect the newest, but the idea is just to target the user's system.
> 
> OK, but I think this is worse.
> 
> Now it's basically always incorrect to ship a build system which uses arch=native, because the people running the build might very reasonably have multiple GPUs in their system, and which GPU clang picks is unspecified.

It's not unspecified per-se, it just picks the one the CUDA driver assigned to ID zero, so it will correspond to the layman using a default device if loaded into CUDA.

The AMDGPU version has a warning when multiple GPUs are found. I should probably add the same thing here as it would make this explicit.

> But we all know people are going to do it anyway.
> 
> Given that this feature cannot correctly be used with a build system, and given that 99.99% of invocations of clang are from a build system that the user running the build did not write, it seems to me that we should not add a feature that is such a footgun when used with a build system.
> 
> (A non-CUDA C++ file compiled with march=native will almost surely run on your computer, whereas this won't, and it's unpredictable whether or not it will, depending on the order the nvidia driver returns GPUs in. So there's no good analogy here.)
> 
> If we were going to add this, I think we should compile for all the GPUs in your system, like Art had assumed. I think that's better, but it has other problems, like slow builds and also the fact that your graphics GPU is likely less powerful than your compute GPU, so now compilation is going to fail because you're e.g. using tensorcores and compiling for a GPU that doesn't have them. So again you can't really use arch=native in a build system, even if you say "requires an sm80 GPU", because really the requirement is "has an sm80 GPU and no others in the machine".

We already do this for CUDA with `--offload-arch=native`. This handling is for targeting NVPTX directly, similar to OpenCL. That means there is no concept of multiple device passes, there can only be a single target architecture just like compiling standard C++ code. I'd like to have `-march=native` because it makes it easier to just build something that works for testing purposes, and it's consistent with all the other native handling, since the NVPTX target is the only one that doesn't support it to my knowledge.

https://github.com/llvm/llvm-project/pull/79373