[PATCH] D157437: AMDGPU: Expand remaining system atomic operations

Wed Aug 16 05:26:24 PDT 2023

arsenm added a comment.

In D157437#4572665 <https://reviews.llvm.org/D157437#4572665>, @JonChesterfield wrote:

> In D157437#4571594 <https://reviews.llvm.org/D157437#4571594>, @yaxunl wrote:
>
>> If users make sure they only use certain memories, then they can specify suitable options for their usecase:
>>
>> corse-grained memory only: -unsafe-fp-atomics -unsafe-pcie-integer-atomics
>> fine-grained memory with PCIE: 
>> fine-grained memory with XGMI: -unsafe-pcie-integer-atomics
>
> I'm not totally sure I'm following the setup here. It sounds like someone writes fetch_or to a global in addrspace(1) and gets it expanded to CAS unless they also pass a scary command line flag which will miscompile other atomic operations which do actually need the expansion.

We kind of need to have a parameter or set of atomic functions which correspond to the various restrictions, that end up emitting some kind of qualifier on regular atomicrmw. The current unsafe-fp-atomics flag is bogus, it came to mean too many things (both ignore memory safety, and ignore FP mode mismatches) and the core issue has absolutely nothing to do with floating point (there is a dimension of floating point issues, which is also underspecified) but we can ignore that part for the moment. The corresponding attribute also breaks on inlining, such that unsafe code can infect safe code.

> We could use 'system' scope to mean 'flat CAS instruction' but as pointed out above that'll kill performance on non-pcie systems, plus it's really easy to emit system scope instructions unintentionally since that's the default.

Yes, that is what we must do. Doesn't matter what the performance, an undecorated atomic should always work no matter what.

> Is the root problem that addrspace annotation is inaccurate? It seems that architecture + which memory contains the variable is exactly sufficient to pick the instruction, at which point I don't see what the command line flags would be for.

This isn't covered by address space, it's possibly should be a few new scopes. If not a scope, some other per-instruction metadata.

> @jhuber6 I think libc uses system scope everywhere because the only intrinsics that give access to device scope are opencl ones with constraints on the right type of atomic type argument. If it expands fetch_or to CAS we're losing performance.

These just need to thread a choice through. I think the openmp device RTL isn't threading through the appropriate scope either

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D157437/new/

https://reviews.llvm.org/D157437