Artem-B wrote: It looks like under the hood 16-bit atomic ops end up being done via 32-bit CAS: https://godbolt.org/z/8zPdGxfWM I wish PTX docs would come with more explicit info on what they actually do on a given GPU variant. https://github.com/llvm/llvm-project/pull/120220