[PATCH] D50845: [CUDA/OpenMP] Define only some host macros during device compilation

Thu Aug 16 11:13:00 PDT 2018

tra added a subscriber: pcc.
tra added a comment.

In https://reviews.llvm.org/D50845#1202551, @ABataev wrote:

> In https://reviews.llvm.org/D50845#1202550, @Hahnfeld wrote:
>
> > In https://reviews.llvm.org/D50845#1202540, @ABataev wrote:
> >
> > > Maybe for device compilation we also should define `__NO_MATH_INLINES` and `__NO_STRING_INLINES` macros to disable inline assembly in glibc?
> >
> >
> > The problem is that `__NO_MATH_INLINES` doesn't even avoid all inline assembly from `bits/mathinline.h` :-( incidentally Clang already defines `__NO_MATH_INLINES` for x86 (due to an old bug which has been fixed long ago) - and on CentOS we still have problems as described in PR38464.
> >
> > As a second thought: This might be valid for NVPTX, but I don't think it's a good idea for x86-like offloading targets - they might well profit from inline assembly code.
>
>
> I'm not saying that we should define those macros for all targets, only for NVPTX. But still, it may disable some inline assembly for other architectures.

IMO, trying to avoid inline assembly by defining(or not) some macros and hoping for the best is rather fragile as we'll have to chase *all* patches that host's math.h may have on any given system.

If I understand it correctly, the root cause of this exercise is that we want to compile for GPU using plain C. CUDA avoids this issue by separating device and host code via target attributes and clang has few special cases to ignore inline assembly errors in the host code if we're compiling for device. For OpenMP there's no such separation, not in the system headers, at least.

Perhaps we can just add another special case for inline assembly & OpenMP. If there's an error in inline assembly during device compilation and we see that the function came from the system headers, then ignore the error, poison the function, etc. That said, I don't know enough about OpenMP to tell whether that's feasible or whether that's sufficient.

Another option would be to implement some sort of attribute-based overloading. Then OpenMP can provide its own version of the device-side library function without clashing with system headers.

<start of general handwaving>
On a side note, I did spend about a year and got 3 almost-but-not-quite-working 'solutions' of exactly this problem during early days of adding CUDA support to clang. I'm very thoroughly convinced that verbatim use of headers from platform A and making them work on platform B is not feasible unless you have control of both sets of headers. Considering that system headers are *not* under our control, we do need to have a way for them to coexist without clashing. Preprocessor magic may work in limited circumstances (e.g. we only need to deal with two variants of headers that never change), but the cases where that approach is going to fall apart are rather easy to come by. Clang's __clang_cuda_runtime_wrapper.h is a horrible example of that -- it sort of works, but every CUDA release I cross my fingers and hope that they didn't decide to change *anything* in their headers.

================
Comment at: test/SemaCUDA/builtins.cu:15-17
+#if !defined(__x86_64__)
+#error "Expected to see preprocessor macros from the host."
 #endif
----------------
Hahnfeld wrote:
> @tra I'm not sure here: Do we want `__PTX__` to be defined during host compilation? I can't think of a valid use case, but you have more experience with user code.
I'm not sure what was the reason for adding this macro. @pcc did it long time ago in rL157173. Perhaps he has better idea about the purpose.

AFAICT, It's not used by CUDA headers, nor can I find any uses in any of CUDA sources we have (excluding clang's own tests). The only use case I see is in cl_kernel.h in Apple's xcode SDK. 
System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/2.0/include/cl_kernel.h
It's used there to #define a lot of convert_FOO to __builtin_convert(FOO...).

Based on that single use case, not defining __PTX__ for the host compilation should probably be OK.

Repository:
  rC Clang

https://reviews.llvm.org/D50845