[clang] [llvm] [AMDGPU] Introduce asyncmark/wait intrinsics (PR #173259)

Tue Jan 6 07:45:24 PST 2026

================
@@ -6594,12 +6594,18 @@ operations.
 ``buffer/global/flat_load/store/atomic`` instructions to global memory are
 termed vector memory operations.
 
+.. _amdgpu-lds-dma-bits:
+
 ``global_load_lds`` or ``buffer/global_load`` instructions with the `lds` flag
 are LDS DMA loads. They interact with caches as if the loaded data were
 being loaded to registers and not to LDS, and so therefore support the same
-cache modifiers. They cannot be performed atomically. They implement volatile
-(via aux/cpol bit 31) and nontemporal (via metadata) as if they were loads
-from the global address space.
+cache modifiers. They cannot be performed atomically. They can be performed with
+asynchronous, volatile and nontemporal semantics as if they were loads
+from the global address space:
+
+- asynchronous: aux bit u0x40000000
+- volatile: aux bit u0x20
----------------
jayfoad wrote:

Nit: I'd strongly recommend using standard `0x` instead of IR's weird `u0x` prefix in documentation.

https://github.com/llvm/llvm-project/pull/173259