[llvm] [NVPTX] Stop using 16-bit CAS instructions from PTX (PR #120220)

Wed Jan 15 14:14:16 PST 2025

akshayrdeodhar wrote:

> > On the flipside, this degrades codegen for 16-bit cmpxchg: https://godbolt.org/z/rMbfc33dz
> 
> Interesting. One thing I notice that 32-bit cmpxchg loop does issue a `YIELD` on each iteration, while `atom.cas.b16` does not. Is `YIELD` required to guarantee forward progress for all threads in a warp?
> 
> E.g. if the atomic var is constantly changed by some other CM, and we're unlucky to have cmpxchg failing on the time, will 32-bit version allow other threads in a warp to progress, while the 16-bit one would keep them stuck?

The yield is a eccentricity of the assembler, because there is a loop in the ptx generated. It isn't necessary for correctness here, I believe. Will be following up with a change (probably will take a while, because it depends on PTX features) which will get rid of the per-iteration yield. There isn't one for atom.cas.b16, as there is no emulation loop in the ptx.

https://github.com/llvm/llvm-project/pull/120220