[PATCH] D47849: [OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls

Fri Aug 10 02:53:51 PDT 2018

Hahnfeld removed a reviewer: Hahnfeld.
Hahnfeld added a comment.

I feel like there is no progress in the discussion (here and off-list), partly because we might still not be talking about the same things. So I'm stepping down from this revision to unblock review from somebody else.

Here's my current understanding of the issue(s):

- `math.h` (or transitively included files) on both PowerPC and x86 contain inline assembly.
  - On x86 Clang directly bails out because the code is using the `x` input constraint which doesn't exist for NVPTX (-> `invalid input constraint 'x' in asm`).
  - From my understanding the header passes Sema analysis on PowerPC, but rejects CodeGen because the assembly instructions are invalid on NVPTX?
- This problem can be avoided (for testing purposes; including `math.h` should be fixed as well some day!) by explicitly declaring all needed math functions (like `extern double exp(double);`)
  - Without additional flags this makes Clang emit Intrinsic Functions <https://llvm.org/docs/LangRef.html#intrinsic-functions> like `@llvm.exp.f64` for NVPTX.
  - That's because `IsMathErrnoDefault()` returns `false` for the Cuda ToolChain. This behaviour can be overwritten using `-fmath-errno` (the test case `nvptx_device_math_functions.c` uses this flag; I'm not sure why?)
- That at least looks to be producing correct IR in both cases which is then passed to the backend:
  1. For intrinsic functions (with some notable exceptions) the backend complains `Cannot select: [...] ExternalSymbol'exp'`.
    - Some exceptions are `sqrt.f32`, `sqrt.f64`, `sin.f32` and `cos.f32`: The backend will directly lower them to the corresponding PTX instruction. Unfortunately there is none for `exp`...
  2. For "real" function calls (like `call double @exp(double %3)`) `nvlink` will throw `Undefined reference` errors.

This patch takes the following approach:

1. Avoid intrinsics for math builtins by passing `-fno-math-builtin` for device compilation.
2. Use the CUDA header to redirect math functions to their libdevice equivalents in the frontend, mostly just prefixed by `__nv_` (for example `exp(a)` -> `__nv_exp(a)`).

The downside of this approach is that LLVM doesn't recognize these function calls and doesn't perform optimizations to fold libcalls. For example `pow(a, 2)` is transformed into a multiplication but `__nv_pow(a, 2)` is not.

In https://reviews.llvm.org/D47849#1124638, @Hahnfeld wrote:

> IMO this goes into the right direction, we should use the fast implementation in libdevice.

So yeah, my comment seems to be outdated if these simple optimizations don't happen anymore with this patch: I don't want to use a fast `pow(a, 2)`, I don't want to call a library function for that at all.

We could of course make LLVM recognize the calls to libdevice and handle them the same way. But that's adding more workarounds to make this patch not regress on easy cases (in terms of transformations).
Another approach would be to make the NVPTX backend lower remaining calls of math functions to libdevice equivalents. I came across https://reviews.llvm.org/D34708 which seems to go into that direction (but doesn't work out-of-the-box after fixing some build errors, complaing about `Undefined external symbol`s because libdevice is optimized away as it wasn't needed before)...

Repository:
  rC Clang

https://reviews.llvm.org/D47849