https://github.com/Artem-B approved this pull request. LGTM. I've checked how far back ptxas supports FP operations on .b32/.b64 registers, and it appears to work in CUDA versions as old as 9.1: https://godbolt.org/z/nbvPe57dc https://github.com/llvm/llvm-project/pull/140487