[PATCH] D80897: [OpenMP] Initial support for std::complex in target regions

Mon Jun 1 12:27:58 PDT 2020

tra added a comment.

In D80897#2066952 <https://reviews.llvm.org/D80897#2066952>, @jdoerfert wrote:

> In D80897#2066723 <https://reviews.llvm.org/D80897#2066723>, @tra wrote:
>
> > Hmm. I'm pretty sure tensorflow is using std::complex for various types. I'm surprised that we haven't seen these functions missing.
>
>
> Which functions and missing from where? In CUDA-mode we did provide `__XXXXc3` already.

I mean the `__XXXXc3` functions added by the patch. I've tried with clang as it is now, before your patch.

> 
> 
>> Plain CUDA (e.g. https://godbolt.org/z/Us6oXC) code appears to have no references to `__mul*` or `__div*`, at least for optimized builds, but they do popup in unoptimized ones. Curiously enough, unoptimized code compiled with `-stdlib=libc++ --std=c++11` does not need the soft-float functions. That would explain why we don't see the build breaks.
> 
> Its not that simple, and tbh, I don't have the full picture yet. Plain (clang) CUDA uses these functions (https://godbolt.org/z/dp_FY2), they just disappear after inlining because of the linkage. If you however enable `-fast-math` they are not used (https://godbolt.org/z/_N-STh). I couldn't run with stdlib=libc++ locally and godbold cuts of the output so I'm not sure if they are used and inlined or not used.

I've checked it locally and verified that adding `--stdlib=libc++ -std=c++11` to your first example shows that `__*c3` functions do not appear in IR regardless of inlining or opt level.
I wonder what is that that libstdc++ does that makes those functions show up in IR. AFAICT, it's not invoked directly by the library, so it must be something clang has generated. Perhaps something should be fixed there.

>> These differences suggest that these changes may need to be more nuanced with regard to the standard c++ library version and, possibly, the C++ standard used.
>>  If possible, I would prefer to limit interference with the standard libraries only to the cases where it's necessary.
> 
> The way I understand this is that we can always provide correct weak versions of `__XXXXc3` without any correctness issues. They will be stripped if they are not needed anyway. That said, this patch should not modify the CUDA behavior (except minor float vs double corrections in the `__XXXXc3` methods). Could you elaborate what interference you expect?

One example would be if/when we grow a better libm support for GPUs. Granted, it's just few functions and we could just remove these instances then.
I agree that adding these functions now will probably not interfere with anything we have now -- they are device-side overloads and nobody calls them directly.
The suggestion was based on a general principle of minimizing the changes that overlap with the standard libraries -- there are quite a few versions out there and I can't predict what quirks of theirs I'm not aware of. I've been burned too many times by that to be wary.

================
Comment at: clang/lib/Headers/__clang_cuda_complex_builtins.h:63
+
+__DEVICE__ double _Complex __muldc3(double __a, double __b, double __c,
+                                    double __d) {
----------------
jdoerfert wrote:
> tra wrote:
> > Soft-float library has bunch of other functions. https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html
> > 
> > I wonder why only the complex variants of the soft-float support functions are missing. 
> > Does it mean that x86 code also does rely on the library to do complex multiplication?
> > If x86 can do complex ops, why can't nvptx?
> > If x86 can't, would make sense to teach it?
> > I wonder why only the complex variants of the soft-float support functions are missing.
> 
> I would guess others are conceptually missing too, the question is if we need them. I did grep the clang source for 7 non-complex soft-float support functions from the different categories listed in the gcc docs, none was found.
> 
> > Does it mean that x86 code also does rely on the library to do complex multiplication?
> 
> I think so, yes. Some system library will provide the implementation of `__muldc3` for the slow path of a complex multiplication.
> 
> > If x86 can do complex ops, why can't nvptx?
> > If x86 can't, would make sense to teach it?
> 
> I think I don't understand this (and maybe the question above). What we do in CUDA right now, and with this patch in OpenMP, is to provide the `__XXXXc3` functions on the device. Usually they are in some system library that we just not have on the device so we have to add them somehow. 
I'm OK with providing device-side equivalents of the host standard library.

What' I'm trying to figure out if why we don't need to do it in some cases.
In case whe we do rely on these functions, but don't have them, we have at least two choices -- provide the missing functions (this patch) or ensure we never need these functions (what I'm trying to figure out). If there's a way to reliably ensure that we don't need these functions, I'd prefer that.

Right now the observation is that libc++ somehow avoids it. If we can improve clang that libstdc++ would also work without falling back on the `__*c3` functions, that may be a better fix for this. That said, I don't understand yet why/how the standard c++ libraries end up with different code in this case.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80897/new/

https://reviews.llvm.org/D80897