[clang] [NVPTX] Add support for -march=native in standalone NVPTX (PR #79373)

Wed Jan 24 17:51:34 PST 2024

jhuber6 wrote:

Some interesting points, I'll try to clarify some things.

> This option may not as well as one would hope.
> 
> Problem #1 is that it will drastically slow down compilation for some users. NVIDIA GPU drivers are loaded on demand, and the process takes a while (O(second), depending on the kind and number of GPUs). If you build on a headless machine, they will get loaded during GPU probing step, and they will get unloaded after that. For each compilation. This will also affect folks who use AMD GPUs to run graphics, but use NVIDIA GPUs for compute (my current machine is set up that way). It can be worked around by enabling driver persistence, but there would be no obvious cues for the user that they would need to do so.

On my machine, which the GPUs already loaded, calling `nvptx-arch` takes about 15ms. For the headless situation, I've noticed that if I have no started XORG on my server it will take up to 250ms, which is what I'm assuming you're referring to. I think this latency is reasonable, but we'd probably want to document what it does under the hood.

> Problem #2 is that it will likely result in unnecessary compilation for nontrivial subset of users who have separate GPUs dedicated to compute and do not care to compile for a separate GPU they use for graphics. The `arch=native` will create a working configuration, but would build more than necessary. Again, the end user would not be aware of that.

It will target the first GPU it finds. We could maybe change the behavior to detect the newest, but the idea is just to target the user's system. I suppose this is somewhat different to the existing `--offload-arch=native` which will correctly copmile for all supported GPUs.

> Problem #3 -- it adds an extra step to the reproducibility/debugging process. If/when someone reports an issue with a compilation done with `-mnative`, we'll inevitably have to start with clarifying questions -- what exactly was the hardware configuration of the machine where the compilation was done.

I'm not so sure, the actual architecture will show up when doing `-v` or with an LLVM stack dump, so unless the bug report is really unhelpful it should be visible somewhere. But I suppose it's possible. I think that it's much less intuitive currently where we'll just have it default to `sm_52` and then not execute anything when that fails to load. Either  that or JIT the PTX we may or may not include.

> With my "GPU support dude for nontrivial number of users" hat on, I personally would really like not to open this can of worms. It's not a very big deal, but my gut is telling me that I will see all three cases once the option makes it into the tribal knowledge (hi, reddit & stack overflow!).
> 
> So, in short, the source code changes are OK, but I'm not a huge fan of `-mnative` in principle (both CPU and GPU variants). If others find it useful, I'm OK with adding the option, but it should probably come with documented caveats so affected users have a chance to find the answer if/when they run into trouble.

There's some argument against the `native` operations that users are accustomed to, but because the CPU does it I feel like it's helpful to make the GPU do it for cases where the user just wants something that's guaranteed to work. This not working is weird considering that `-mcpu=native` works for AMDGPU and `--offload-arch=native` works for CUDA, HIP, and OpenMP currently. 

https://github.com/llvm/llvm-project/pull/79373