[clang] [Clang] Support fp16 in libdevice for CUDA 13.3 (PR #174005)

Wed Jan 7 13:44:09 PST 2026

YonahGoldberg wrote:

Let me give a little more context here. Currently, as you guys pointed out, the small FP types in CUDA are just integer wrappers. `__half` is just `struct {unsigned short}` and `__half2` is `struct {unsigned short; unsigned short}`. Currently `cuda_fp16.hpp` implements arithmetic on these types (mostly) by casting to short/unsigned int and then calling inline PTX.

There are a few reasons for moving the implementation of this arithmetic to libdevice:
1. float and double arithmetic already lives there so it makes it more uniform
2. MLIR code targeting LLVM can call into the fp16 functions
3. We'd like to move the implementations away from inline PTX in the future to native LLVM half + intrinsics. Unfortunately, as you pointed out, CUDA doesn't have any native fp16 support, but we can link with libdevice, which can have native fp16 implementations.

As you can see the builtins take in unsigned short/unsigned int to be compatible with the CUDA layer, but cast to half/ 2xhalf because we want the libdevice API to operate on these types.

> It would be great if we could just build libdevice from source (or incorporate those sources into clang headers, if they were released under acceptable license). That would be an improvement over the current binary IR blob + handwritten wrappers/declarations mess. We've had typos, we've had missing functions, we've had missed optimizations, the list goes on...

Yeah I think I agree with this. IP-wise I don't think there's anything stopping us from distributing the source. I can ask about that.

https://github.com/llvm/llvm-project/pull/174005