r283680 - [CUDA] Support <complex> and std::min/max on the device.

Mon Oct 10 18:54:45 PDT 2016

----- Original Message -----
> From: "Justin Lebar" <jlebar at google.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "Clang Commits" <cfe-commits at lists.llvm.org>
> Sent: Saturday, October 8, 2016 10:56:37 PM
> Subject: Re: r283680 - [CUDA] Support <complex> and std::min/max on the device.
> 
> > > The fix is not as simple as simply changing our implementation of
> > e.g.
> > std::isnan to call __builtin_isnanf, because we also would want to
> > fix
> > ::isnanf,
> >
> > No, if I understand what you're saying, you specifically wouldn't.
> 
> I understand how this is feasible on the CPU, because ::isnan is a
> library function that can never be inlined.  But on the GPU, these
> library functions are (at the moment) always declared inline.  That
> seems to complicate this idea.
> 
> Right now ::isnan(x) is going to call __nv_isnan(x), which computes
> abs(x).  If we pass -ffast-math, the compiler will be able to assume
> that abs(x) is not nan.  I guess you're saying that we would need to
> special-case __nv_isnan so that -ffast-math is always off
> (essentially).  But, what if it gets inlined?
> 
> It looks like libstdc++'s std::isnan calls __builtin_isnan (same for
> its std::isinf), and its ::isnan is an alias for std::isnan.  So
> libstdc++'s isnan is going to return false with -ffast-math (or
> anyway
> it will do the same thing as the builtin functions, which aiui is
> what
> you're proposing libc++'s isnan *not* do).

This was not my first choice, but was the direction that Marshall preferred based on our conversations up to that point. I had not noticed this aspect of libstdc++'s behavior. It is indeed the case that, with libstdc++, std::isnan gets optimized away with -ffast-math, but ::isnan does not. That might be desirable, or it might just be weird given that I'd expect std::isnan and ::isnan to essentally do the same thing for POD FP types.

> 
> > This is important for use cases where, for example, even though the
> > user might want fast math, they still need to check their inputs
> > for NaNs.
> 
> Since this isn't going to work with libstdc++, and it relies on not
> doing anything that the compiler might construe as "arithmetic" on
> the
> value, this seems pretty dicey to me.  One could instead compile a
> separate TU without -ffast-math and do all their validation there?
> I'd have a lot more confidence in that working today, continuing to
> work tomorrow, and being portable across compilers and standard
> libraries.

I certainly agree that I have a higher confidence in the multiple TU approach.

> 
> I don't mean to relitigate https://reviews.llvm.org/D18639, but I am
> not convinced that libc++'s isnan should have a path that returns
> true
> with -ffast-math, given that
> 
>  * libstdc++'s isnan will always return false with -ffast-math,
>  * it's at best complicated for us to make this work if you can
>  inline
> the body of isnan (as we can on the GPU),
>  * it's at best complicated for users to write "correct" C++ that
> calls isnan with -ffast-math, especially if they want their code to
> continue to work in the future in the face of changing compilers
> (-ffast-math is not specified anywhere, so who knows what it means),
> and
>  * there's a relatively simple workaround (use a separate TU) that
> sidesteps all these problems.
> 
> I'm not saying we should go in and change libc++'s CPU implementation
> of isnan to call the builtin.  I'll leave that up to people who care
> about CPU code.  But at least on the GPU, it still makes sense to me
> to fix the problem you originally identified by making
> std::/::isnan/isinf always return false/true with -ffast-math.  Which
> I think we should be able to do with the intrinsic upgrade I
> originally suggested.
> 
> On a separate note: Can we make __libcpp_isnan and __libcpp_isinf
> constexpr?  This will make them implicitly host+device functions,
> solving the problem on the GPU.  Otherwise I may have to reimplement
> these functions in a header, and that's lame.  Although I am clearly
> not above that.  :)

I think this makes sense ;) We should check with Eric or Marshall.

 -Hal

> 
> On Sat, Oct 8, 2016 at 6:50 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> > ----- Original Message -----
> >> From: "Justin Lebar" <jlebar at google.com>
> >> To: "Hal Finkel" <hfinkel at anl.gov>
> >> Cc: "Clang Commits" <cfe-commits at lists.llvm.org>
> >> Sent: Saturday, October 8, 2016 6:16:12 PM
> >> Subject: Re: r283680 - [CUDA] Support <complex> and std::min/max
> >> on the device.
> >>
> >> Hal,
> >>
> >> On NVPTX, these functions eventually get resolved to function
> >> calls
> >> in
> >> libdevice, e.g. __nv_isinff and __nv_isnanf.
> >>
> >> llvm does not do a good job understanding the body of e.g.
> >> __nvvm_isnanf, because it uses nvptx-specific intrinsic functions,
> >> notably @llvm.nvvm.fabs.f.  These are opaque to the LLVM
> >> optimizer.
> >>
> >> The fix is not as simple as simply changing our implementation of
> >> e.g.
> >> std::isnan to call __builtin_isnanf, because we also would want to
> >> fix
> >> ::isnanf,
> >
> > No, if I understand what you're saying, you specifically wouldn't.
> > We had a discussion about this on the review thread(s) that led to
> > r283051, and while we want to elide the checks inside the
> > mathematical functions, we don't want to replace isnan itself with
> > something that will get optimized away. We want to keep the
> > ability for the user to explicitly check for NaNs, etc. even if we
> > don't want those checks to appear inside of mathematical
> > operations. This is important for use cases where, for example,
> > even though the user might want fast math, they still need to
> > check their inputs for NaNs.
> >
> >  -Hal
> >
> >> but we can't override that implementation without some
> >> major
> >> surgery on the nvptx headers.
> >>
> >> David Majnemer and I talked about one way to fix this, namely by
> >> using
> >> IR intrinsic upgrades to replace the opaque nvptx intrinsics with
> >> LLVM
> >> intrinsics.  LLVM would then be able to understand these
> >> intrinsics
> >> and optimize them.  We would reap benefits not just for
> >> std::isnan,
> >> but also e.g. constant-folding calls like std::abs that also
> >> eventually end up in libnvvm.
> >>
> >> I did the first half of this work, by adding lowerings for the
> >> various
> >> LLVM intrinsics to the NVPTX backend [1].  But David is now busy
> >> with
> >> other things and hasn't been able to help with the second half,
> >> namely
> >> using IR upgrades to replace the nvptx target-specific intrinsics
> >> with
> >> generalized LLVM intrinsics.  Perhaps this is something you'd be
> >> able
> >> to help with?
> >>
> >> In any case, using builtins here without fixing std::isnan and
> >> ::isnan
> >> feels to me to be the wrong solution.  It seems to me that we
> >> should
> >> be able to rely on std::isnan and friends being fast, and if
> >> they're
> >> not, we should fix that.  Using builtins here would be "cheating"
> >> to
> >> make our implementation faster than user code.
> >>
> >> I'll note, separately, that on x86, clang does not seem to
> >> constant-fold std::isinf or __builtin_isinff to false with
> >> -ffast-math
> >> -ffinite-math-only.  GCC can do it.  Clang gets std::isnan.
> >> https://godbolt.org/g/vZB55a
> >>
> >> By the way, the changes you made to libc++ unfortunately break
> >> this
> >> patch with libc++, because e.g. __libcpp_isnan is not a device
> >> function.  I'll have to think about how to fix that -- I may send
> >> you
> >> a patch.
> >>
> >> Regards,
> >> -Justin
> >>
> >> [1] https://reviews.llvm.org/D24300
> >>
> >> On Sat, Oct 8, 2016 at 3:36 PM, Hal Finkel <hfinkel at anl.gov>
> >> wrote:
> >> > Hi Justin,
> >> >
> >> > This is neat!
> >> >
> >> > I see a bunch of uses of std::isinf, etc. here. It tends to be
> >> > important that, when using -ffast-math (or -ffinite-math-only)
> >> > these checks get optimized away. Can you please check that they
> >> > do? If not, you might mirror what I've done in r283051 for
> >> > libc++,
> >> > which is similar to what libstdc++ ends up doing, so that we use
> >> > __builtin_isnan/isinf/isfinite.
> >> >
> >> > Thanks again,
> >> > Hal
> >> >
> >> > ----- Original Message -----
> >> >> From: "Justin Lebar via cfe-commits"
> >> >> <cfe-commits at lists.llvm.org>
> >> >> To: cfe-commits at lists.llvm.org
> >> >> Sent: Saturday, October 8, 2016 5:16:13 PM
> >> >> Subject: r283680 - [CUDA] Support <complex> and std::min/max on
> >> >> the device.
> >> >>
> >> >> Author: jlebar
> >> >> Date: Sat Oct  8 17:16:12 2016
> >> >> New Revision: 283680
> >> >>
> >> >> URL: http://llvm.org/viewvc/llvm-project?rev=283680&view=rev
> >> >> Log:
> >> >> [CUDA] Support <complex> and std::min/max on the device.
> >> >>
> >> >> Summary:
> >> >> We do this by wrapping <complex> and <algorithm>.
> >> >>
> >> >> Tests are in the test-suite.
> >> >>
> >> >> Reviewers: tra
> >> >>
> >> >> Subscribers: jhen, beanz, cfe-commits, mgorny
> >> >>
> >> >> Differential Revision: https://reviews.llvm.org/D24979
> >> >>
> >> >> Added:
> >> >>     cfe/trunk/lib/Headers/__clang_cuda_complex_builtins.h
> >> >>     cfe/trunk/lib/Headers/cuda_wrappers/
> >> >>     cfe/trunk/lib/Headers/cuda_wrappers/algorithm
> >> >>     cfe/trunk/lib/Headers/cuda_wrappers/complex
> >> >> Modified:
> >> >>     cfe/trunk/lib/Driver/ToolChains.cpp
> >> >>     cfe/trunk/lib/Headers/CMakeLists.txt
> >> >>     cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h
> >> >>
> >> >> Modified: cfe/trunk/lib/Driver/ToolChains.cpp
> >> >> URL:
> >> >> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Driver/ToolChains.cpp?rev=283680&r1=283679&r2=283680&view=diff
> >> >> ==============================================================================
> >> >> --- cfe/trunk/lib/Driver/ToolChains.cpp (original)
> >> >> +++ cfe/trunk/lib/Driver/ToolChains.cpp Sat Oct  8 17:16:12
> >> >> 2016
> >> >> @@ -4694,6 +4694,15 @@ void Linux::AddClangCXXStdlibIncludeArgs
> >> >>
> >> >>  void Linux::AddCudaIncludeArgs(const ArgList &DriverArgs,
> >> >>                                 ArgStringList &CC1Args) const {
> >> >> +  if (!DriverArgs.hasArg(options::OPT_nobuiltininc)) {
> >> >> +    // Add cuda_wrappers/* to our system include path.  This
> >> >> lets
> >> >> us
> >> >> wrap
> >> >> +    // standard library headers.
> >> >> +    SmallString<128> P(getDriver().ResourceDir);
> >> >> +    llvm::sys::path::append(P, "include");
> >> >> +    llvm::sys::path::append(P, "cuda_wrappers");
> >> >> +    addSystemInclude(DriverArgs, CC1Args, P);
> >> >> +  }
> >> >> +
> >> >>    if (DriverArgs.hasArg(options::OPT_nocudainc))
> >> >>      return;
> >> >>
> >> >>
> >> >> Modified: cfe/trunk/lib/Headers/CMakeLists.txt
> >> >> URL:
> >> >> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Headers/CMakeLists.txt?rev=283680&r1=283679&r2=283680&view=diff
> >> >> ==============================================================================
> >> >> --- cfe/trunk/lib/Headers/CMakeLists.txt (original)
> >> >> +++ cfe/trunk/lib/Headers/CMakeLists.txt Sat Oct  8 17:16:12
> >> >> 2016
> >> >> @@ -24,10 +24,13 @@ set(files
> >> >>    bmiintrin.h
> >> >>    __clang_cuda_builtin_vars.h
> >> >>    __clang_cuda_cmath.h
> >> >> +  __clang_cuda_complex_builtins.h
> >> >>    __clang_cuda_intrinsics.h
> >> >>    __clang_cuda_math_forward_declares.h
> >> >>    __clang_cuda_runtime_wrapper.h
> >> >>    cpuid.h
> >> >> +  cuda_wrappers/algorithm
> >> >> +  cuda_wrappers/complex
> >> >>    clflushoptintrin.h
> >> >>    emmintrin.h
> >> >>    f16cintrin.h
> >> >>
> >> >> Added: cfe/trunk/lib/Headers/__clang_cuda_complex_builtins.h
> >> >> URL:
> >> >> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Headers/__clang_cuda_complex_builtins.h?rev=283680&view=auto
> >> >> ==============================================================================
> >> >> --- cfe/trunk/lib/Headers/__clang_cuda_complex_builtins.h
> >> >> (added)
> >> >> +++ cfe/trunk/lib/Headers/__clang_cuda_complex_builtins.h Sat
> >> >> Oct
> >> >>  8
> >> >> 17:16:12 2016
> >> >> @@ -0,0 +1,203 @@
> >> >> +/*===-- __clang_cuda_complex_builtins - CUDA impls of runtime
> >> >> complex fns ---===
> >> >> + *
> >> >> + * Permission is hereby granted, free of charge, to any person
> >> >> obtaining a copy
> >> >> + * of this software and associated documentation files (the
> >> >> "Software"), to deal
> >> >> + * in the Software without restriction, including without
> >> >> limitation
> >> >> the rights
> >> >> + * to use, copy, modify, merge, publish, distribute,
> >> >> sublicense,
> >> >> and/or sell
> >> >> + * copies of the Software, and to permit persons to whom the
> >> >> Software is
> >> >> + * furnished to do so, subject to the following conditions:
> >> >> + *
> >> >> + * The above copyright notice and this permission notice shall
> >> >> be
> >> >> included in
> >> >> + * all copies or substantial portions of the Software.
> >> >> + *
> >> >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
> >> >> KIND,
> >> >> EXPRESS OR
> >> >> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> >> >> MERCHANTABILITY,
> >> >> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO
> >> >> EVENT
> >> >> SHALL THE
> >> >> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
> >> >> DAMAGES
> >> >> OR
> >> >> OTHER
> >> >> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
> >> >> OTHERWISE,
> >> >> ARISING FROM,
> >> >> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> >> >> OTHER
> >> >> DEALINGS IN
> >> >> + * THE SOFTWARE.
> >> >> + *
> >> >> +
> >> >> *===-----------------------------------------------------------------------===
> >> >> + */
> >> >> +
> >> >> +#ifndef __CLANG_CUDA_COMPLEX_BUILTINS
> >> >> +#define __CLANG_CUDA_COMPLEX_BUILTINS
> >> >> +
> >> >> +// This header defines __muldc3, __mulsc3, __divdc3, and
> >> >> __divsc3.
> >> >>  These are
> >> >> +// libgcc functions that clang assumes are available when
> >> >> compiling
> >> >> c99 complex
> >> >> +// operations.  (These implementations come from libc++, and
> >> >> have
> >> >> been modified
> >> >> +// to work with CUDA.)
> >> >> +
> >> >> +extern "C" inline __device__ double _Complex __muldc3(double
> >> >> __a,
> >> >> double __b,
> >> >> +                                                      double
> >> >> __c,
> >> >> double __d) {
> >> >> +  double __ac = __a * __c;
> >> >> +  double __bd = __b * __d;
> >> >> +  double __ad = __a * __d;
> >> >> +  double __bc = __b * __c;
> >> >> +  double _Complex z;
> >> >> +  __real__(z) = __ac - __bd;
> >> >> +  __imag__(z) = __ad + __bc;
> >> >> +  if (std::isnan(__real__(z)) && std::isnan(__imag__(z))) {
> >> >> +    int __recalc = 0;
> >> >> +    if (std::isinf(__a) || std::isinf(__b)) {
> >> >> +      __a = std::copysign(std::isinf(__a) ? 1 : 0, __a);
> >> >> +      __b = std::copysign(std::isinf(__b) ? 1 : 0, __b);
> >> >> +      if (std::isnan(__c))
> >> >> +        __c = std::copysign(0, __c);
> >> >> +      if (std::isnan(__d))
> >> >> +        __d = std::copysign(0, __d);
> >> >> +      __recalc = 1;
> >> >> +    }
> >> >> +    if (std::isinf(__c) || std::isinf(__d)) {
> >> >> +      __c = std::copysign(std::isinf(__c) ? 1 : 0, __c);
> >> >> +      __d = std::copysign(std::isinf(__d) ? 1 : 0, __d);
> >> >> +      if (std::isnan(__a))
> >> >> +        __a = std::copysign(0, __a);
> >> >> +      if (std::isnan(__b))
> >> >> +        __b = std::copysign(0, __b);
> >> >> +      __recalc = 1;
> >> >> +    }
> >> >> +    if (!__recalc && (std::isinf(__ac) || std::isinf(__bd) ||
> >> >> +                      std::isinf(__ad) || std::isinf(__bc))) {
> >> >> +      if (std::isnan(__a))
> >> >> +        __a = std::copysign(0, __a);
> >> >> +      if (std::isnan(__b))
> >> >> +        __b = std::copysign(0, __b);
> >> >> +      if (std::isnan(__c))
> >> >> +        __c = std::copysign(0, __c);
> >> >> +      if (std::isnan(__d))
> >> >> +        __d = std::copysign(0, __d);
> >> >> +      __recalc = 1;
> >> >> +    }
> >> >> +    if (__recalc) {
> >> >> +      // Can't use std::numeric_limits<double>::infinity() --
> >> >> that
> >> >> doesn't have
> >> >> +      // a device overload (and isn't constexpr before C++11,
> >> >> naturally).
> >> >> +      __real__(z) = __builtin_huge_valf() * (__a * __c - __b *
> >> >> __d);
> >> >> +      __imag__(z) = __builtin_huge_valf() * (__a * __d + __b *
> >> >> __c);
> >> >> +    }
> >> >> +  }
> >> >> +  return z;
> >> >> +}
> >> >> +
> >> >> +extern "C" inline __device__ float _Complex __mulsc3(float
> >> >> __a,
> >> >> float __b,
> >> >> +                                                     float
> >> >> __c,
> >> >> float __d) {
> >> >> +  float __ac = __a * __c;
> >> >> +  float __bd = __b * __d;
> >> >> +  float __ad = __a * __d;
> >> >> +  float __bc = __b * __c;
> >> >> +  float _Complex z;
> >> >> +  __real__(z) = __ac - __bd;
> >> >> +  __imag__(z) = __ad + __bc;
> >> >> +  if (std::isnan(__real__(z)) && std::isnan(__imag__(z))) {
> >> >> +    int __recalc = 0;
> >> >> +    if (std::isinf(__a) || std::isinf(__b)) {
> >> >> +      __a = std::copysign(std::isinf(__a) ? 1 : 0, __a);
> >> >> +      __b = std::copysign(std::isinf(__b) ? 1 : 0, __b);
> >> >> +      if (std::isnan(__c))
> >> >> +        __c = std::copysign(0, __c);
> >> >> +      if (std::isnan(__d))
> >> >> +        __d = std::copysign(0, __d);
> >> >> +      __recalc = 1;
> >> >> +    }
> >> >> +    if (std::isinf(__c) || std::isinf(__d)) {
> >> >> +      __c = std::copysign(std::isinf(__c) ? 1 : 0, __c);
> >> >> +      __d = std::copysign(std::isinf(__d) ? 1 : 0, __d);
> >> >> +      if (std::isnan(__a))
> >> >> +        __a = std::copysign(0, __a);
> >> >> +      if (std::isnan(__b))
> >> >> +        __b = std::copysign(0, __b);
> >> >> +      __recalc = 1;
> >> >> +    }
> >> >> +    if (!__recalc && (std::isinf(__ac) || std::isinf(__bd) ||
> >> >> +                      std::isinf(__ad) || std::isinf(__bc))) {
> >> >> +      if (std::isnan(__a))
> >> >> +        __a = std::copysign(0, __a);
> >> >> +      if (std::isnan(__b))
> >> >> +        __b = std::copysign(0, __b);
> >> >> +      if (std::isnan(__c))
> >> >> +        __c = std::copysign(0, __c);
> >> >> +      if (std::isnan(__d))
> >> >> +        __d = std::copysign(0, __d);
> >> >> +      __recalc = 1;
> >> >> +    }
> >> >> +    if (__recalc) {
> >> >> +      __real__(z) = __builtin_huge_valf() * (__a * __c - __b *
> >> >> __d);
> >> >> +      __imag__(z) = __builtin_huge_valf() * (__a * __d + __b *
> >> >> __c);
> >> >> +    }
> >> >> +  }
> >> >> +  return z;
> >> >> +}
> >> >> +
> >> >> +extern "C" inline __device__ double _Complex __divdc3(double
> >> >> __a,
> >> >> double __b,
> >> >> +                                                      double
> >> >> __c,
> >> >> double __d) {
> >> >> +  int __ilogbw = 0;
> >> >> +  // Can't use std::max, because that's defined in
> >> >> <algorithm>,
> >> >> and
> >> >> we don't
> >> >> +  // want to pull that in for every compile.  The CUDA headers
> >> >> define
> >> >> +  // ::max(float, float) and ::max(double, double), which is
> >> >> sufficient for us.
> >> >> +  double __logbw = std::logb(max(std::abs(__c),
> >> >> std::abs(__d)));
> >> >> +  if (std::isfinite(__logbw)) {
> >> >> +    __ilogbw = (int)__logbw;
> >> >> +    __c = std::scalbn(__c, -__ilogbw);
> >> >> +    __d = std::scalbn(__d, -__ilogbw);
> >> >> +  }
> >> >> +  double __denom = __c * __c + __d * __d;
> >> >> +  double _Complex z;
> >> >> +  __real__(z) = std::scalbn((__a * __c + __b * __d) / __denom,
> >> >> -__ilogbw);
> >> >> +  __imag__(z) = std::scalbn((__b * __c - __a * __d) / __denom,
> >> >> -__ilogbw);
> >> >> +  if (std::isnan(__real__(z)) && std::isnan(__imag__(z))) {
> >> >> +    if ((__denom == 0.0) && (!std::isnan(__a) ||
> >> >> !std::isnan(__b)))
> >> >> {
> >> >> +      __real__(z) = std::copysign(__builtin_huge_valf(), __c)
> >> >> *
> >> >> __a;
> >> >> +      __imag__(z) = std::copysign(__builtin_huge_valf(), __c)
> >> >> *
> >> >> __b;
> >> >> +    } else if ((std::isinf(__a) || std::isinf(__b)) &&
> >> >> std::isfinite(__c) &&
> >> >> +               std::isfinite(__d)) {
> >> >> +      __a = std::copysign(std::isinf(__a) ? 1.0 : 0.0, __a);
> >> >> +      __b = std::copysign(std::isinf(__b) ? 1.0 : 0.0, __b);
> >> >> +      __real__(z) = __builtin_huge_valf() * (__a * __c + __b *
> >> >> __d);
> >> >> +      __imag__(z) = __builtin_huge_valf() * (__b * __c - __a *
> >> >> __d);
> >> >> +    } else if (std::isinf(__logbw) && __logbw > 0.0 &&
> >> >> std::isfinite(__a) &&
> >> >> +               std::isfinite(__b)) {
> >> >> +      __c = std::copysign(std::isinf(__c) ? 1.0 : 0.0, __c);
> >> >> +      __d = std::copysign(std::isinf(__d) ? 1.0 : 0.0, __d);
> >> >> +      __real__(z) = 0.0 * (__a * __c + __b * __d);
> >> >> +      __imag__(z) = 0.0 * (__b * __c - __a * __d);
> >> >> +    }
> >> >> +  }
> >> >> +  return z;
> >> >> +}
> >> >> +
> >> >> +extern "C" inline __device__ float _Complex __divsc3(float
> >> >> __a,
> >> >> float __b,
> >> >> +                                                     float
> >> >> __c,
> >> >> float __d) {
> >> >> +  int __ilogbw = 0;
> >> >> +  float __logbw = std::logb(max(std::abs(__c),
> >> >> std::abs(__d)));
> >> >> +  if (std::isfinite(__logbw)) {
> >> >> +    __ilogbw = (int)__logbw;
> >> >> +    __c = std::scalbn(__c, -__ilogbw);
> >> >> +    __d = std::scalbn(__d, -__ilogbw);
> >> >> +  }
> >> >> +  float __denom = __c * __c + __d * __d;
> >> >> +  float _Complex z;
> >> >> +  __real__(z) = std::scalbn((__a * __c + __b * __d) / __denom,
> >> >> -__ilogbw);
> >> >> +  __imag__(z) = std::scalbn((__b * __c - __a * __d) / __denom,
> >> >> -__ilogbw);
> >> >> +  if (std::isnan(__real__(z)) && std::isnan(__imag__(z))) {
> >> >> +    if ((__denom == 0) && (!std::isnan(__a) ||
> >> >> !std::isnan(__b)))
> >> >> {
> >> >> +      __real__(z) = std::copysign(__builtin_huge_valf(), __c)
> >> >> *
> >> >> __a;
> >> >> +      __imag__(z) = std::copysign(__builtin_huge_valf(), __c)
> >> >> *
> >> >> __b;
> >> >> +    } else if ((std::isinf(__a) || std::isinf(__b)) &&
> >> >> std::isfinite(__c) &&
> >> >> +               std::isfinite(__d)) {
> >> >> +      __a = std::copysign(std::isinf(__a) ? 1 : 0, __a);
> >> >> +      __b = std::copysign(std::isinf(__b) ? 1 : 0, __b);
> >> >> +      __real__(z) = __builtin_huge_valf() * (__a * __c + __b *
> >> >> __d);
> >> >> +      __imag__(z) = __builtin_huge_valf() * (__b * __c - __a *
> >> >> __d);
> >> >> +    } else if (std::isinf(__logbw) && __logbw > 0 &&
> >> >> std::isfinite(__a) &&
> >> >> +               std::isfinite(__b)) {
> >> >> +      __c = std::copysign(std::isinf(__c) ? 1 : 0, __c);
> >> >> +      __d = std::copysign(std::isinf(__d) ? 1 : 0, __d);
> >> >> +      __real__(z) = 0 * (__a * __c + __b * __d);
> >> >> +      __imag__(z) = 0 * (__b * __c - __a * __d);
> >> >> +    }
> >> >> +  }
> >> >> +  return z;
> >> >> +}
> >> >> +
> >> >> +#endif // __CLANG_CUDA_COMPLEX_BUILTINS
> >> >>
> >> >> Modified: cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h
> >> >> URL:
> >> >> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h?rev=283680&r1=283679&r2=283680&view=diff
> >> >> ==============================================================================
> >> >> --- cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h
> >> >> (original)
> >> >> +++ cfe/trunk/lib/Headers/__clang_cuda_runtime_wrapper.h Sat
> >> >> Oct
> >> >>  8
> >> >> 17:16:12 2016
> >> >> @@ -312,6 +312,7 @@ __device__ inline __cuda_builtin_gridDim
> >> >>
> >> >>  #include <__clang_cuda_cmath.h>
> >> >>  #include <__clang_cuda_intrinsics.h>
> >> >> +#include <__clang_cuda_complex_builtins.h>
> >> >>
> >> >>  // curand_mtgp32_kernel helpfully redeclares blockDim and
> >> >>  threadIdx
> >> >>  in host
> >> >>  // mode, giving them their "proper" types of dim3 and uint3.
> >> >>   This
> >> >>  is
> >> >>
> >> >> Added: cfe/trunk/lib/Headers/cuda_wrappers/algorithm
> >> >> URL:
> >> >> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Headers/cuda_wrappers/algorithm?rev=283680&view=auto
> >> >> ==============================================================================
> >> >> --- cfe/trunk/lib/Headers/cuda_wrappers/algorithm (added)
> >> >> +++ cfe/trunk/lib/Headers/cuda_wrappers/algorithm Sat Oct  8
> >> >> 17:16:12
> >> >> 2016
> >> >> @@ -0,0 +1,96 @@
> >> >> +/*===---- complex - CUDA wrapper for <algorithm>
> >> >> ----------------------------===
> >> >> + *
> >> >> + * Permission is hereby granted, free of charge, to any person
> >> >> obtaining a copy
> >> >> + * of this software and associated documentation files (the
> >> >> "Software"), to deal
> >> >> + * in the Software without restriction, including without
> >> >> limitation
> >> >> the rights
> >> >> + * to use, copy, modify, merge, publish, distribute,
> >> >> sublicense,
> >> >> and/or sell
> >> >> + * copies of the Software, and to permit persons to whom the
> >> >> Software is
> >> >> + * furnished to do so, subject to the following conditions:
> >> >> + *
> >> >> + * The above copyright notice and this permission notice shall
> >> >> be
> >> >> included in
> >> >> + * all copies or substantial portions of the Software.
> >> >> + *
> >> >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
> >> >> KIND,
> >> >> EXPRESS OR
> >> >> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> >> >> MERCHANTABILITY,
> >> >> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO
> >> >> EVENT
> >> >> SHALL THE
> >> >> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
> >> >> DAMAGES
> >> >> OR
> >> >> OTHER
> >> >> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
> >> >> OTHERWISE,
> >> >> ARISING FROM,
> >> >> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> >> >> OTHER
> >> >> DEALINGS IN
> >> >> + * THE SOFTWARE.
> >> >> + *
> >> >> +
> >> >> *===-----------------------------------------------------------------------===
> >> >> + */
> >> >> +
> >> >> +#ifndef __CLANG_CUDA_WRAPPERS_ALGORITHM
> >> >> +#define __CLANG_CUDA_WRAPPERS_ALGORITHM
> >> >> +
> >> >> +// This header defines __device__ overloads of std::min/max,
> >> >> but
> >> >> only if we're
> >> >> +// <= C++11.  In C++14, these functions are constexpr, and so
> >> >> are
> >> >> implicitly
> >> >> +// __host__ __device__.
> >> >> +//
> >> >> +// We don't support the initializer_list overloads because
> >> >> +// initializer_list::begin() and end() are not __host__
> >> >> __device__
> >> >> functions.
> >> >> +//
> >> >> +// When compiling in C++14 mode, we could force std::min/max
> >> >> to
> >> >> have
> >> >> different
> >> >> +// implementations for host and device, by declaring the
> >> >> device
> >> >> overloads
> >> >> +// before the constexpr overloads appear.  We choose not to do
> >> >> this
> >> >> because
> >> >> +
> >> >> +//  a) why write our own implementation when we can use one
> >> >> from
> >> >> the
> >> >> standard
> >> >> +//     library? and
> >> >> +//  b) libstdc++ is evil and declares min/max inside a header
> >> >> that
> >> >> is included
> >> >> +//     *before* we include <algorithm>.  So we'd have to
> >> >> unconditionally
> >> >> +//     declare our __device__ overloads of min/max, but that
> >> >> would
> >> >> pollute
> >> >> +//     things for people who choose not to include
> >> >> <algorithm>.
> >> >> +
> >> >> +#include_next <algorithm>
> >> >> +
> >> >> +#if __cplusplus <= 201103L
> >> >> +
> >> >> +// We need to define these overloads in exactly the namespace
> >> >> our
> >> >> standard
> >> >> +// library uses (including the right inline namespace),
> >> >> otherwise
> >> >> they won't be
> >> >> +// picked up by other functions in the standard library (e.g.
> >> >> functions in
> >> >> +// <complex>).  Thus the ugliness below.
> >> >> +#ifdef _LIBCPP_BEGIN_NAMESPACE_STD
> >> >> +_LIBCPP_BEGIN_NAMESPACE_STD
> >> >> +#else
> >> >> +namespace std {
> >> >> +#ifdef _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >> +_GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >> +#endif
> >> >> +#endif
> >> >> +
> >> >> +template <class __T, class __Cmp>
> >> >> +inline __device__ const __T &
> >> >> +max(const __T &__a, const __T &__b, __Cmp __cmp) {
> >> >> +  return __cmp(__a, __b) ? __b : __a;
> >> >> +}
> >> >> +
> >> >> +template <class __T>
> >> >> +inline __device__ const __T &
> >> >> +max(const __T &__a, const __T &__b) {
> >> >> +  return __a < __b ? __b : __a;
> >> >> +}
> >> >> +
> >> >> +template <class __T, class __Cmp>
> >> >> +inline __device__ const __T &
> >> >> +min(const __T &__a, const __T &__b, __Cmp __cmp) {
> >> >> +  return __cmp(__b, __a) ? __b : __a;
> >> >> +}
> >> >> +
> >> >> +template <class __T>
> >> >> +inline __device__ const __T &
> >> >> +min(const __T &__a, const __T &__b) {
> >> >> +  return __a < __b ? __b : __a;
> >> >> +}
> >> >> +
> >> >> +#ifdef _LIBCPP_END_NAMESPACE_STD
> >> >> +_LIBCPP_END_NAMESPACE_STD
> >> >> +#else
> >> >> +#ifdef _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >> +_GLIBCXX_END_NAMESPACE_VERSION
> >> >> +#endif
> >> >> +} // namespace std
> >> >> +#endif
> >> >> +
> >> >> +#endif // __cplusplus <= 201103L
> >> >> +#endif // __CLANG_CUDA_WRAPPERS_ALGORITHM
> >> >>
> >> >> Added: cfe/trunk/lib/Headers/cuda_wrappers/complex
> >> >> URL:
> >> >> http://llvm.org/viewvc/llvm-project/cfe/trunk/lib/Headers/cuda_wrappers/complex?rev=283680&view=auto
> >> >> ==============================================================================
> >> >> --- cfe/trunk/lib/Headers/cuda_wrappers/complex (added)
> >> >> +++ cfe/trunk/lib/Headers/cuda_wrappers/complex Sat Oct  8
> >> >> 17:16:12
> >> >> 2016
> >> >> @@ -0,0 +1,79 @@
> >> >> +/*===---- complex - CUDA wrapper for <complex>
> >> >> ------------------------------===
> >> >> + *
> >> >> + * Permission is hereby granted, free of charge, to any person
> >> >> obtaining a copy
> >> >> + * of this software and associated documentation files (the
> >> >> "Software"), to deal
> >> >> + * in the Software without restriction, including without
> >> >> limitation
> >> >> the rights
> >> >> + * to use, copy, modify, merge, publish, distribute,
> >> >> sublicense,
> >> >> and/or sell
> >> >> + * copies of the Software, and to permit persons to whom the
> >> >> Software is
> >> >> + * furnished to do so, subject to the following conditions:
> >> >> + *
> >> >> + * The above copyright notice and this permission notice shall
> >> >> be
> >> >> included in
> >> >> + * all copies or substantial portions of the Software.
> >> >> + *
> >> >> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
> >> >> KIND,
> >> >> EXPRESS OR
> >> >> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> >> >> MERCHANTABILITY,
> >> >> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO
> >> >> EVENT
> >> >> SHALL THE
> >> >> + * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
> >> >> DAMAGES
> >> >> OR
> >> >> OTHER
> >> >> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
> >> >> OTHERWISE,
> >> >> ARISING FROM,
> >> >> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
> >> >> OTHER
> >> >> DEALINGS IN
> >> >> + * THE SOFTWARE.
> >> >> + *
> >> >> +
> >> >> *===-----------------------------------------------------------------------===
> >> >> + */
> >> >> +
> >> >> +#pragma once
> >> >> +
> >> >> +// Wrapper around <complex> that forces its functions to be
> >> >> __host__
> >> >> +// __device__.
> >> >> +
> >> >> +// First, include host-only headers we think are likely to be
> >> >> included by
> >> >> +// <complex>, so that the pragma below only applies to
> >> >> <complex>
> >> >> itself.
> >> >> +#if __cplusplus >= 201103L
> >> >> +#include <type_traits>
> >> >> +#endif
> >> >> +#include <stdexcept>
> >> >> +#include <cmath>
> >> >> +#include <sstream>
> >> >> +
> >> >> +// Next, include our <algorithm> wrapper, to ensure that
> >> >> device
> >> >> overloads of
> >> >> +// std::min/max are available.
> >> >> +#include <algorithm>
> >> >> +
> >> >> +#pragma clang force_cuda_host_device begin
> >> >> +
> >> >> +// When compiling for device, ask libstdc++ to use its own
> >> >> implements of
> >> >> +// complex functions, rather than calling builtins (which
> >> >> resolve
> >> >> to
> >> >> library
> >> >> +// functions that don't exist when compiling CUDA device
> >> >> code).
> >> >> +//
> >> >> +// This is a little dicey, because it causes libstdc++ to
> >> >> define
> >> >> a
> >> >> different
> >> >> +// set of overloads on host and device.
> >> >> +//
> >> >> +//   // Present only when compiling for host.
> >> >> +//   __host__ __device__ void complex<float> sin(const
> >> >> complex<float>& x) {
> >> >> +//     return __builtin_csinf(x);
> >> >> +//   }
> >> >> +//
> >> >> +//   // Present when compiling for host and for device.
> >> >> +//   template <typename T>
> >> >> +//   void __host__ __device__ complex<T> sin(const complex<T>&
> >> >> x)
> >> >> {
> >> >> +//     return complex<T>(sin(x.real()) * cosh(x.imag()),
> >> >> +//                       cos(x.real()), sinh(x.imag()));
> >> >> +//   }
> >> >> +//
> >> >> +// This is safe because when compiling for device, all
> >> >> function
> >> >> calls in
> >> >> +// __host__ code to sin() will still resolve to *something*,
> >> >> even
> >> >> if
> >> >> they don't
> >> >> +// resolve to the same function as they resolve to when
> >> >> compiling
> >> >> for host.  We
> >> >> +// don't care that they don't resolve to the right function
> >> >> because
> >> >> we won't
> >> >> +// codegen this host code when compiling for device.
> >> >> +
> >> >> +#pragma push_macro("_GLIBCXX_USE_C99_COMPLEX")
> >> >> +#pragma push_macro("_GLIBCXX_USE_C99_COMPLEX_TR1")
> >> >> +#define _GLIBCXX_USE_C99_COMPLEX 0
> >> >> +#define _GLIBCXX_USE_C99_COMPLEX_TR1 0
> >> >> +
> >> >> +#include_next <complex>
> >> >> +
> >> >> +#pragma pop_macro("_GLIBCXX_USE_C99_COMPLEX_TR1")
> >> >> +#pragma pop_macro("_GLIBCXX_USE_C99_COMPLEX")
> >> >> +
> >> >> +#pragma clang force_cuda_host_device end
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> cfe-commits mailing list
> >> >> cfe-commits at lists.llvm.org
> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
> >> >>
> >> >
> >> > --
> >> > Hal Finkel
> >> > Lead, Compiler Technology and Programming Languages
> >> > Leadership Computing Facility
> >> > Argonne National Laboratory
> >>
> >
> > --
> > Hal Finkel
> > Lead, Compiler Technology and Programming Languages
> > Leadership Computing Facility
> > Argonne National Laboratory
> 

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory