[llvm] [NVPTX] Select bfloat16 add/mul/sub as fma on SM80 (PR #121065)

Thu Jan 9 15:48:35 PST 2025

Artem-B wrote:

> The spec says
> 
> > `add{.rnd}.bf16` and `add{.rnd}.bf16x2` requires `sm_90` or higher.
> 
> I don't see any suggestion that that only applies for specific PTX versions.

`sm_90` is only supported by PTX 7.8 and newer:
![image](https://github.com/user-attachments/assets/c859abe8-af63-4d8e-97b9-eb7ec5ba3fe5)

FMA instruction for bf16 types requires PTX 7.0 and sm_80: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#half-precision-floating-point-instructions-fma

FADD/FMUL for bf16 requires PTX 7.8 and sm_90. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#half-precision-floating-point-instructions-mul

So, the only cases where we can benefit from this patch are:
* PTX <= 7.0 and GPU >= sm_80(otherwise, there's no BF16 FMA support)
* PTX < 7.8 (otherwise FMUL is available, and we don't need the patch).

sm_90 or newer GPUs require PTX 7.8 and therefore do not benefit from the patch.
What's left is sm_80 and PTX versions 7.0 through 7.7.

What do I miss?

https://github.com/llvm/llvm-project/pull/121065