[PATCH] D60907: [OpenMP] Add math functions support in OpenMP offloading

Tue Apr 30 11:29:52 PDT 2019

jdoerfert added a comment.

In D60907#1484529 <https://reviews.llvm.org/D60907#1484529>, @hfinkel wrote:

> In D60907#1483660 <https://reviews.llvm.org/D60907#1483660>, @jdoerfert wrote:
>
> > In D60907#1483615 <https://reviews.llvm.org/D60907#1483615>, @hfinkel wrote:
> >
> > > In D60907#1479370 <https://reviews.llvm.org/D60907#1479370>, @gtbercea wrote:
> > >
> > > > In D60907#1479142 <https://reviews.llvm.org/D60907#1479142>, @hfinkel wrote:
> > > >
> > > > > In D60907#1479118 <https://reviews.llvm.org/D60907#1479118>, @gtbercea wrote:
> > > > >
> > > > > > Ping @hfinkel @tra
> > > > >
> > > > >
> > > > > The last two comments in D47849 <https://reviews.llvm.org/D47849> indicated exploration of a different approach, and one which still seems superior to this one. Can you please comment on why you're now pursuing this approach instead?
> > > >
> > > >
> > > > ...
> > > >
> > > > Hal, as far as I can tell, this solution is similar to yours but with a slightly different implementation. If there are particular aspects about this patch you would like to discuss/give feedback on please let me know.
> > >
> > >
> > > The solution I suggested had the advantages of:
> > >
> > > 1. Being able to directly reuse the code in `__clang_cuda_device_functions.h`. On the other hand, using this solution we need to implement a wrapper function for every math function. When `__clang_cuda_device_functions.h` is updated, we need to update the OpenMP wrapper as well.
> >
> >
> > I'd even go as far as to argue that `__clang_cuda_device_functions.h` should include the internal math.h wrapper to get all math functions. See also the next comment.
> >
> > > 2. Providing access to wrappers for other CUDA intrinsics in a natural way (e.g., rnorm3d) [it looks a bit nicer to provide a host version of rnorm3d than __nv_rnorm3d in user code].
> >
> > @hfinkel 
> >  I don't see why you want to mix CUDA intrinsics with math.h overloads.
>
>
> What I had in mind was matching non-standard functions in a standard way. For example, let's just say that I have a CUDA kernel that uses the rnorm3d function, or I otherwise have a function that I'd like to write in OpenMP that will make good use of this CUDA function (because it happens to have an efficient device implementation). This is a function that CUDA provides, in the global namespace, although it's not standard.
>
> Then I can do something like this (depending on how we setup the implementation):
>
>   double rnorm3d(double a,  double b, double c) {
>     return sqrt(a*a + b*b + c*c);
>   }
>   
>   ...
>   
>   #pragma omp target
>   {
>     double a = ..., b = ..., c = ...;
>     double r = rnorm3d(a, b, c)
>   }
>   
>
> and, if we use the CUDA math headers for CUDA math-function support, than this might "just work." To be clear, I can see an argument for having this work being a bad idea ;) -- but it has the advantage of providing a way to take advantage of system-specific functions while still writing completely-portable code.

Matching `rnorm3d` and replacing it with some nvvm "intrinsic" is something I wouldn't like to see happening if `math.h` was included and not if it was not. As you say, in Cuda that is not how it works either. I'm in favor of reusing the built-in recognition mechanism:
That is, if the target is nvptx, the name is rnorm3d, we match that name and use the appropriate intrinsic, as we do others already for other targets.

>>   I added a rough outline of how I imagined the internal math.h header to look like as a comment in D47849. Could you elaborate how that differs from what you imagine and how the other intrinsics come in?
> 
> That looks like what I had in mind (including `__clang_cuda_device_functions.h` to get the device functions.)
> 
>> 
>> 
>>> 3. Being similar to the "declare variant" functionality from OpenMP 5, and thus, I suspect, closer to the solution we'll eventually be able to apply in a standard way to all targets.
>> 
>> I can see this.
>> 
>>>> This solution is following Alexey's suggestions. This solution allows the optimization of math calls if they apply (example: pow(x,2) => x*x ) which was one of the issues in the previous solution I implemented.
>>> 
>>> So we're also missing that optimization for CUDA code when compiling with Clang? Isn't this also something that, regardless, should be fixed?
>> 
>> Maybe through a general built-in recognition and lowering into target specific implementations/intrinsics late again?
> 
> I suspect that we need to match the intrinsics and perform the optimizations in LLVM at that level in order to get the optimizations for CUDA.

That seems reasonable to me. We could also match other intrinsics, e.g., `rnorm3d`, here as well, both by name but also by the computation pattern.

In D60907#1484643 <https://reviews.llvm.org/D60907#1484643>, @tra wrote:

> +1 to Hal's comments.
>
> @jdoerfert :
>
> > I'd even go as far as to argue that __clang_cuda_device_functions.h should include the internal math.h wrapper to get all math functions. See also the next comment.
>
> I'd argue other way around -- include __clang_cuda_device_functions.h from math.h and do not preinclude anything.
>  If the user does not include math.h, it should not have its namespace polluted by some random stuff. NVCC did this, but that's one of the most annoying 'features' we have to be compatible with for the sake of keeping existing nvcc-compilable CUDA code happy.
>
> If users do include math.h, it should do the right thing, for both sides of the compilation.
>  IMO It's math.h that should be triggering pulling device functions in.

I actually don't want to preinclude anything and my arguments are (mostly) for the OpenMP offloading code path not necessarily Cuda.
Maybe to clarify, what I want is:

1. Make sure the `clang/Headers/math.h` is found first if `math.h` is included.
2. Use a scheme similar to the one described https://reviews.llvm.org/D47849#1483653 in `clang/Headers/math.h`
3. Only add `math.h` function overloads in our `math.h`. **<- This is debatable**
4. Include `clang/Headers/math.h` from `__clang_cuda_device_functions.h` to avoid duplication of math function declarations.

Repository:
  rC Clang

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D60907/new/

https://reviews.llvm.org/D60907