[llvm] [NVPTX] Stop using 16-bit CAS instructions from PTX (PR #120220)

Wed Jan 22 11:25:11 PST 2025

akshayrdeodhar wrote:

Discussed this with @gonzalobg- 
Whether YIELD is required to guarantee forward progress on some GPUs depends on what algorithm the code is executing. ptxas inserts these YIELDs as needed. When performing a single isolated 16-bit atomic CAS, no YIELD is needed. ptxas is aware of this and avoids the YIELD when using atom.cas.b16, but when emulating it using atom.cas.b32 it currently does not. This PR introduces a slight regression for this synthetic case. We'll be improving this in a future PR (it requires some more changes on our end but we are working to get there). In practice, CAS are not performed in isolation (e.g. they are very common as part of CAS-loops). When a CAS loop uses atom.cas.b16, a YIELD _is_ generated, same as with atom.cas.b32. However, since LLVM generates an emulation loop, using 32-bit CAS keeps it as a single emulation loop, while using atom.cas.b16 within the CAS loop results in two nested loops. So, for this common use case performing the emulation direclty in LLVM by using CAS 32-bit results in better codegen than generating these CAS loops using atom.cas.b16.

For example, if LLVM expands atomicrmw add i16 with 16-bit CAS, the SASS will contain nested loops because ptxas expands atom.cas.b16 to a loop that uses atom.cas.b32. If expanded using 32-bit-CAS, we avoid this, ending up with much better SASS code (even though the PTX looks worse).

https://github.com/llvm/llvm-project/pull/120220