[PATCH] D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain

Artem Belevich via Phabricator via cfe-commits cfe-commits at lists.llvm.org
Thu May 31 15:23:23 PDT 2018


tra added a comment.

In https://reviews.llvm.org/D47394#1118223, @gtbercea wrote:

> I tried this example (https://devblogs.nvidia.com/separate-compilation-linking-cuda-device-code/). It worked with NVCC but not with clang++. I can produce the main.o particle.o and v.o objects as relocatable (-fcuda-rdc) but the final step fails with a missing reference error.


It's not clear what exactly you mean by the "final step" and what exactly was the error. Could you give me more details?

> This leads me to believe that embedding the CUDA fatbin code in the host object comes with limitations. If I were to change the OpenMP NVPTX toolchain to do the same then I would run into similar problems.

It's a two-part problem.

In the end, we need to place GPU-side binary (whether it's an object or an executable) in a way that CUDA tools can recognize. You should end up with pretty much the same set of bits. If clang currently does not do that well enough, we should fix it.

Second part is what do we do about GPU-side object files. NVCC has some under-the-hood magic that invokes nvlink. If we invoke clang for the final linking phase, it has no idea that some of .o files may have GPU code in it that may need extra steps before we can pass everything to the linker to produce the host executable. IMO the linking of GPU-side objects should be done outside of clang. I.e. one could do it with an extra build rule which would invoke `nvcc --device-link  ...` to link all GPU-side objects into a GPU executable, still wrapped in a host .o, which can then be linked into the host executable.

> On the other hand., the example, ported to use OpenMP declare target regions (instead of __device__) it all compiles, links and runs correctly.
> 
> In general, I feel that if we go the way you propose then the solution is truly confined to NVPTX. If we instead implement a scheme like the one in this patch then we give other toolchains a chance to perhaps fill the nvlink "gap" and eventually be able to handle offloading in a similar manner and support static linking.

I'm not sure how is "fatbin + clang -fcuda-gpubinary" is any more confining to NVPTX than "fatbin + clang + ld -r" -- either way you rely on nvidia-specific tool. If at some point you find it too confining, changing either of those will require pretty much the same amount of work.


Repository:
  rC Clang

https://reviews.llvm.org/D47394





More information about the cfe-commits mailing list