[PATCH] D47849: [OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls

Fri Jul 20 15:59:18 PDT 2018

tra added a comment.

In https://reviews.llvm.org/D47849#1126925, @gtbercea wrote:

> I just stumbled upon a very interesting situation.
>
> I noticed that, for OpenMP, the use of device math functions happens as I expected for -O0. For -O1 or higher math functions such as "sqrt" resolve to llvm builtins/intrinsics:
>
>   call double @llvm.sqrt.f64(double %1)
>
>
> instead of the nvvm variant.

I believe we do have a pass that attempts to replace some nvvm intrinsics  with their llvm equivalent. It allows us to optimize the code better. My guess would be that the change does not happen with -O0.

> The surprising part (at least to me) is that the same llvm intrinsic is used when I use Clang to compile CUDA kernel code calling the "sqrt" function. I would have expected that the NVVM variant would be called for CUDA code.

What we may end up generating for any given standard library call from the device side depends on number of factors and may vary.
Here's what typically happens:

- clang parses CUDA headers and pulls 'standard' C math functions and bits of C++ overloads. These usually call __something.
- CUDA versions up to 8.0 provided those __something() functions which *usually* called __nv_something() in libdevice.
- As of CUDA-9 __something became NVCC's compiler builtins and clang has to provide its own implementation -- __clang_cuda_device_functions.h. This implementation may use whatever works that does the job. Any of __builtin.../__nvvm.../__nv_... are fair game, as long as it works.
- CUDA wrapper headers in clang do some magic to make math parts of standard C++ library working by magic by providing some functions to do the right thing. Usually those forward to the C math functions, but it may not always be the case.
- LLVM may update some __nvvm* intrinsics to their llvm equivalent.

In the end you may end up with somewhat different IR depending on the function and the CUDA version clang used.

> Is it ok for CUDA kernels to call llvm intrinsics instead of the device specific math library functions?

It depends. We can not lower all LLVM intrinsics. Generally you can't use intrinsics that are lowered to external library call.

> If it's ok for CUDA can this be ok for OpenMP NVPTX too?
>  If not we probably need to fix it for both toolchains.

I don't have an answer for these. OpenMP seems to have somewhat different requirements compared to C++ which we assume for CUDA.

On thing you do need to consider, though, is that the wrapper headers are rather unstable. Their goal is to provide a glue between half-broken CUDA headers and the user's code. They are not intended to provide any sort of stability to anyone else. Every new CUDA version brings new and exciting changes to its headers which requires fair amount of changes in the wrappers.

If all you need is C math functions, it *may* be OK, but, perhaps, there may be a better approach.
Why not compile a real math library to bitcode and avoid all this weirdness with gluing together half-broken pieces of CUDA that are broken by design? Unlike real CUDA compilation, you don't have the constraint that you have to match NVCC 1:1. If you have your own device-side math library you could use regular math headers and link real libm.bc instead of CUDA's libdevice. The rumors of "high performance" functions in the libdevice are somewhat exaggerated , IMO. If you take a look at the IR in the libdevice of recent CUDA version, you will see that a lot of the functions just call their llvm counterpart. If it turns out that in some case llvm generates slower code than what nvidia provides, I'm sure it will be possible to implement a reasonably fast replacement.

Repository:
  rC Clang

https://reviews.llvm.org/D47849