[PATCH] D47849: [OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls

Wed Aug 22 09:58:46 PDT 2018

gregrodgers added a comment.

I like the idea of using an automatic include as a cc1 option (-include).  However, I would prefer a more general automatic include for OpenMP, not just for math functions (__clang_cuda_device_functions.h). Clang cuda automatically includes __clang_cuda_runtime_wrapper.h.  It includes other files as needed like __clang_cuda_device_functions.h.  Lets hypothetically call my proposed automatic include for OpenMP , __clang_openmp_runtime_wrapper.h.

Just because clang cuda defines functions in __clang_cuda_device_functins.h and automatically includes them does not make it right for OpenMP.  In general, function definitions in headers should be avoided. The current function definitions in __clang_cuda_device_functions.h only work for hostile nv GPUs :).  This is how we can avoid function definitions in the headers.  In a new openmp build process, we can build libm-nvptx.bc.  This can be done by compiling __clang_cuda_device_functions.h as a device-only compile.  Assuming current naming conventions, these files would be installed in the same directory as libomptarget.so (.../lib).

How do we tell clang cc1 to use this bc library? Use -mlink-builtin-bitcode.   AddMathDeviceFunctions would then look something like this.

if (this is for device cc1) {

  CC1Args.push_back("-mlink-builtin-bitcode");
  if ( getTriple().isNVPTX())
    CC1Args.push_back(DriverArgs.MakeArgString("libm-nvptx.bc"));
  if ( getTriple().getArch() == llvm::Triple::amdgcn);
    CC1Args.push_back(DriverArgs.MakeArgString("libm-amdgcn.bc"));

}

You can think of libm-<arch>.bc file as the device library equivalent of the host libm.so or libm.a.  This concept of "host-consistent" library definitions can go beyond math libraries.  In fact, I believe we should co-opt the -l (--library) option.  The driver toolchain should look for device bc libraries for any -lX command line option.  This gives us a strategy for adding user-defined device libraries.

The above code hints at the idea of architecture specific bc files (nvptx vs amdgcn).  The nvptx version would call into the cuda libdevice.  For radeon processors, we may want processor-optimized versions of the libraries, just like there are sub-architecture optimized versions of the cuda libdevice.  If we build --cuda-cuda-gpu-arch optimized versions of math bc libs,  then the above code will get a bit more complex depending on naming convention of the bc lib and the value of
 --cuda-gpu-arch (which should have an alias --offload-arch).

Using a bc lib, significantly reduces the complexity of __clang_openmp_runtime_wrapper.h.  We do not not need or see math device function definitions or the nv headers that they need. However, it does need to correct the behaviour of rogue system headers that define host-optimized functions. We can fix this by adding the following to __clang_openmp_runtime_wrapper.h so that host passes still get host-optimized functions.

#if defined(__AMDGCN__) || defined(__NVPTX__)
#define __NO_INLINE__ 1
#endif

There is a tradeoff to using pre-compiled bc libs.  It makes compile-time macro logic hard to implement.  For example,  we cant do this

#if defined(__CLANG_CUDA_APPROX_TRANSCENDENTALS__)
#define __FAST_OR_SLOW(fast, slow) fast
#else
#define __FAST_OR_SLOW(fast, slow) slow
#endif

The openmp build process would either need to build alternative bc libraries for each option or a supplemental bc library to address these types of options.
If some option is turned on, then an alternative lib or particular ordering of libs would be used to build the clang cc1 command.
For example, the above code for AddMathDeviceFunctions would have this

  ...
  if ( getTriple().isNVPTX()) {
     if (LangOpts.CUDADeviceApproxTranscendentals || LangOpts.FastMath) {
       CC1Args.push_back("-mlink-builtin-bitcode");
       CC1Args.push_back(DriverArgs.MakeArgString("libm-fast-nvptx.bc"));
     }
     CC1Args.push_back("-mlink-builtin-bitcode");
     CC1Args.push_back(DriverArgs.MakeArgString("libm-nvptx.bc"));
  }

I personally believe that pre-built bc libraries with some consistency to their host-equivalent libraries is a more sane approach for device libraries than complex header logic that is customized for each architecture.

Repository:
  rC Clang

https://reviews.llvm.org/D47849