[PATCH] D47394: [OpenMP][Clang][NVPTX] Replace bundling with partial linking for the OpenMP NVPTX device offloading toolchain

Wed May 30 03:30:52 PDT 2018

sfantao added a comment.

> In a discussion off-list I proposed adding constructor functions to all object files and handle them like shared libraries are already handled today (ie register separately and let the runtime figure out how to relocate symbols in different translation units). I don't have an implementation of that approach so I can't claim that it works and doesn't have a huge performance impact (which we don't want either), but it should be agnostic of the offloading target so it may be worth investigating.

I don't understand how this would work. Doing something like that would require reimplementing the GPU-code linker, which requires knowing proprietary information of the GPU binary format. I would know how to resolve all the relocations in the device code. In my view, the solution would only work (or at least be more easily implemented) if we don't have relocatable code.

> Assuming we do proceed with back-to-CUDA approach, one thing I'd consider would be using clang's -fcuda-include-gpubinary option which CUDA uses to include GPU code into the host object. You may be able to use it to avoid compiling and partially linking .fatbin and host .o.

Cool, I agree this is worth investigating.

================
Comment at: lib/Driver/ToolChains/Cuda.cpp:536
+  }
 }

----------------
gtbercea wrote:
> sfantao wrote:
> > What prevents all this from being done in the bundler? If I understand it correctly, if the bundler implements this wrapping all the checks for librariers wouldn't be required and, only two changes would be required in the driver:
> > 
> > - generate fatbin instead of cubin. This is straightforward to do by changing the device assembling job. In terms of the loading of the kernels by the device API, doing it through fatbin or cubin should be equivalent except that fatbin enables storing the PTX format and JIT for newer GPUs.
> > - Use NVIDIA linker as host linker.
> > 
> > This last requirement could be problematic if we get two targets attempting  to use different (incompatible linkers). If we get this kind of incompatibility we should get the appropriate diagnostic.
> What prevents it is the fact that the bundler is called AFTER the HOST and DEVICE object files have been produced. The creation of the fatbin (FATBINARY + CALNG++) needs to happen within the NVPTX toolchain.
> 
Why does it have to happen in NVPTX toolchain, you are making the NVPTX toolchain generate an ELF object from another toolchain, right? What I'm suggesting is to do the stuff that mixes two (or more) toolchains in the bundler. Your inputs are still a fatbin and a host file.   

================
Comment at: test/Driver/openmp-offload.c:497
 // RUN:   %clang -###  -fopenmp=libomp -o %t.out -lsomelib -target powerpc64le-linux -fopenmp-targets=powerpc64le-ibm-linux-gnu,x86_64-pc-linux-gnu %t.i -no-canonical-prefixes 2>&1 \
 // RUN:   | FileCheck -check-prefix=CHK-UBJOBS %s
 // RUN:   %clang -### -fopenmp=libomp -o %t.out -lsomelib -target powerpc64le-linux -fopenmp-targets=powerpc64le-ibm-linux-gnu,x86_64-pc-linux-gnu %t.i -save-temps -no-canonical-prefixes 2>&1 \
----------------
gtbercea wrote:
> gtbercea wrote:
> > sfantao wrote:
> > > We need a test for the static linking. The host linker has to be nvcc in that case, right?
> > The host linker is "ld". The "bundling" step is replaced (in the case of OpenMP NVPTX device offloading only) by a call to "ld -r" to partially link the 2 object files: the object file produced by the HOST toolchain and the object file produced by the OpenMP NVPTX device offloading toolchain (because we want to produce a single output).
> nvcc is not called at all in this patch.
Ok, so how do you link device code? I.e. if you have two compilation units that depend on each other (some definition in one unit is used in the other), where are they linked together? Something has to understand the two files resulting from your "ld -r" step, my understanding is that that something is nvcc that calls nvlink behind the scenes, right? So, nvcc will do the unbundling+linking bit, right?

Repository:
  rC Clang

https://reviews.llvm.org/D47394