[cfe-dev] CUDA and libm

Justin Lebar via cfe-dev cfe-dev at lists.llvm.org
Sun Jan 24 11:29:10 PST 2016


Hi, I'm looking for some advice on the right way to implement math
functions in device-side CUDA code.  I'm pretty new here, so please
forgive the likely many gross misunderstandings below.

There are three classes of things I'm concerned with:

 * Functions declared in math.h (e.g. sinf),
 * Builtin math functions (e.g. __builtin_sinf), and
 * LLVM intrinsic math instructions (e.g. llvm.sin.f32).

At the moment the way this works is that the CUDA headers provided by
nvidia define inline functions along the lines of

  __device__ float sinf(float v) {
  #ifndef FAST_MATH
    return __nv_fast_sinf(v);
  #else
    return __nv_sinf(v);
  #endif
  }

This function is C++, *not* extern "C".  __nv_sin and __nv_fast_sin
are defined in libdevice [1], a bitcode library provided by nvidia.

Some functions inside libdevice, e.g. __nv_fast_sinf, do nothing more
than call the equivalent llvm nvvm intrinsic, e.g.
llvm.nvvm.sin.approx.f.  This then gets lowered to an equivalent nvvm
machine instruction.  Other functions in libdevice, such as __nv_sinf,
do nontrivial computation explicitly written out in the llvm bitcode.

Following so far?  If so, you may note that the state of the world is
rather incomplete!  Here are the problems I see at the moment:

 * Many builtins don't work.  For example, clang emits __builtin_sinf
as a call to sinf [2].  But that function doesn't exist; nvidia's
not-extern-"C" ::sinf is not the same.  In particular this means that
libstdc++ is not going to work well, since it implements e.g.
std::sin(float) as a call to __builtin_sinf.

* Many math optimizations aren't going to work (I think), because e.g.
SimplifyLibCalls checks for function names like "log" [3], but by the
time we get there, we're calling __nv_log.

 * (At least some) llvm intrinsics sort of work, but e.g. llvm.sin.f32
gets lowered to the nvvm instruction sin.approx.f32, while ::sin from
the CUDA headers only does this transformation if fast-math is
enabled.  Maybe this is sound if we only emit llvm.sin.f32 if
fast-math is enabled; I dunno.

My question for the list is about the right way to fix these problems.
It seems to me that since the optimizer explicitly uses knowledge of
the various math functions, we shouldn't define inline versions of
them.  Instead, we should leave them as plain calls to e.g. sinf until
we lower to nvvm in llvm.  We have a header baked into clang that
already disables some CUDA headers; in theory we should be able to
disable CUDA's math_functions.h using the same mechanism.

When lowering to nvvm, we can make the determination as to whether we
want to call a function defined in libdevice, invoke an nvvm
intrinsic, or whatever.  If we call into libdevice we'll want to rerun
some optimization passes, but I presume we can order these order these
passes appropriately.

In order to do this, we'll need libdevice to be available to llvm.
It's not clear to me whether it is at the moment; clang sees it, but
I'm not sure if clang passes both its generated IR and all of
libdevice to llvm, or if it just copies the relevant definitions from
libdevice into the IR it sends to llvm.  If it's the latter, we could
always copy *all* of libdevice into the generated IR.  But I hope we
could do better.

I think this proposal lets us solve all three problems above, but
again I'm not sure I'm not missing something, or if there's a more
canonical way to do this.  Any feedback is appreciated!

-Justin

[1] http://docs.nvidia.com/cuda/libdevice-users-guide/__nv_sin.html#__nv_sin
[2] https://github.com/llvm-mirror/clang/blob/e2636ac0bad65451c3eb6272d7ab3abbba96da17/lib/CodeGen/CGBuiltin.cpp#L1971
[3] https://github.com/llvm-mirror/llvm/blob/b3bc79d5556108307026be07e7eaa644cce041ab/lib/Transforms/Utils/SimplifyLibCalls.cpp#L1339



More information about the cfe-dev mailing list