[cfe-dev] CUDA and libm

Tue Jan 26 14:59:17 PST 2016

Thanks a lot, Hal.  I've sent a patch disabling the standard library
for NVVM in TLI for now, since none of those functions is guaranteed
to be available.  See D16604 for details about why we couldn't just
map e.g. __nv_sin to sin.

-Justin

On Mon, Jan 25, 2016 at 10:00 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> ----- Original Message -----
>> From: "Justin Lebar via cfe-dev" <cfe-dev at lists.llvm.org>
>> To: cfe-dev at lists.llvm.org, c-gpu-team at google.com
>> Sent: Sunday, January 24, 2016 1:29:10 PM
>> Subject: [cfe-dev] CUDA and libm
>>
>> Hi, I'm looking for some advice on the right way to implement math
>> functions in device-side CUDA code.  I'm pretty new here, so please
>> forgive the likely many gross misunderstandings below.
>>
>> There are three classes of things I'm concerned with:
>>
>>  * Functions declared in math.h (e.g. sinf),
>>  * Builtin math functions (e.g. __builtin_sinf), and
>>  * LLVM intrinsic math instructions (e.g. llvm.sin.f32).
>>
>> At the moment the way this works is that the CUDA headers provided by
>> nvidia define inline functions along the lines of
>>
>>   __device__ float sinf(float v) {
>>   #ifndef FAST_MATH
>>     return __nv_fast_sinf(v);
>>   #else
>>     return __nv_sinf(v);
>>   #endif
>>   }
>>
>> This function is C++, *not* extern "C".  __nv_sin and __nv_fast_sin
>> are defined in libdevice [1], a bitcode library provided by nvidia.
>>
>> Some functions inside libdevice, e.g. __nv_fast_sinf, do nothing more
>> than call the equivalent llvm nvvm intrinsic, e.g.
>> llvm.nvvm.sin.approx.f.  This then gets lowered to an equivalent nvvm
>> machine instruction.  Other functions in libdevice, such as
>> __nv_sinf,
>> do nontrivial computation explicitly written out in the llvm bitcode.
>>
>> Following so far?  If so, you may note that the state of the world is
>> rather incomplete!  Here are the problems I see at the moment:
>>
>>  * Many builtins don't work.  For example, clang emits __builtin_sinf
>> as a call to sinf [2].  But that function doesn't exist; nvidia's
>> not-extern-"C" ::sinf is not the same.  In particular this means that
>> libstdc++ is not going to work well, since it implements e.g.
>> std::sin(float) as a call to __builtin_sinf.
>>
>> * Many math optimizations aren't going to work (I think), because
>> e.g.
>> SimplifyLibCalls checks for function names like "log" [3], but by the
>> time we get there, we're calling __nv_log.
>
> We already have a solution to this problem. When you instantiate TLI, you can call TLI::setAvailableWithName to set an alternate name for some of the library calls. We already do some of this in lib/Analysis/TargetLibraryInfo.cpp in initialize.
>
>  -Hal
>
>>
>>  * (At least some) llvm intrinsics sort of work, but e.g.
>>  llvm.sin.f32
>> gets lowered to the nvvm instruction sin.approx.f32, while ::sin from
>> the CUDA headers only does this transformation if fast-math is
>> enabled.  Maybe this is sound if we only emit llvm.sin.f32 if
>> fast-math is enabled; I dunno.
>>
>> My question for the list is about the right way to fix these
>> problems.
>> It seems to me that since the optimizer explicitly uses knowledge of
>> the various math functions, we shouldn't define inline versions of
>> them.  Instead, we should leave them as plain calls to e.g. sinf
>> until
>> we lower to nvvm in llvm.  We have a header baked into clang that
>> already disables some CUDA headers; in theory we should be able to
>> disable CUDA's math_functions.h using the same mechanism.
>>
>> When lowering to nvvm, we can make the determination as to whether we
>> want to call a function defined in libdevice, invoke an nvvm
>> intrinsic, or whatever.  If we call into libdevice we'll want to
>> rerun
>> some optimization passes, but I presume we can order these order
>> these
>> passes appropriately.
>>
>> In order to do this, we'll need libdevice to be available to llvm.
>> It's not clear to me whether it is at the moment; clang sees it, but
>> I'm not sure if clang passes both its generated IR and all of
>> libdevice to llvm, or if it just copies the relevant definitions from
>> libdevice into the IR it sends to llvm.  If it's the latter, we could
>> always copy *all* of libdevice into the generated IR.  But I hope we
>> could do better.
>>
>> I think this proposal lets us solve all three problems above, but
>> again I'm not sure I'm not missing something, or if there's a more
>> canonical way to do this.  Any feedback is appreciated!
>>
>> -Justin
>>
>> [1]
>> http://docs.nvidia.com/cuda/libdevice-users-guide/__nv_sin.html#__nv_sin
>> [2]
>> https://github.com/llvm-mirror/clang/blob/e2636ac0bad65451c3eb6272d7ab3abbba96da17/lib/CodeGen/CGBuiltin.cpp#L1971
>> [3]
>> https://github.com/llvm-mirror/llvm/blob/b3bc79d5556108307026be07e7eaa644cce041ab/lib/Transforms/Utils/SimplifyLibCalls.cpp#L1339
>> _______________________________________________
>> cfe-dev mailing list
>> cfe-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory