[clang] [llvm] [AMDGPU] Track tensor load/store DMAs with asyncmark (PR #200775)

Thu Jun 4 03:25:43 PDT 2026

================
@@ -615,3 +615,38 @@ scale variant.
   matrix A or B data can be reused from a previous WMMA instruction.
 }];
 }
+
+//===----------------------------------------------------------------------===//
+// Tensor DMA Builtins
+//===----------------------------------------------------------------------===//
+
+def DocCatTensorDMA : DocumentationCategory<"Tensor DMA Builtins"> {
+  let Content = [{
+Asynchronous tensor DMA transfers between global memory and LDS, tracked by
+the ``TENSOR_CNT`` hardware counter. The caller must order the transfer
+against later LDS accesses, either via ``s_wait_tensorcnt`` or via
+``__builtin_amdgcn_wait_asyncmark``. See the LLVM ``AMDGPUAsyncOperations``
+document for the async-operation model.
+}];
----------------
ssahasra wrote:

I don't think we should mention ``TENSOR_CNT`` at all. But I am okay if we do. Instead of specifying any particular "later accesses", just say "track completion using ``TENSOR_CNT`` or *asyncmarks*". If this is RST, then *asyncmarks* can be linked to the correct doc using `:ref:`.

https://github.com/llvm/llvm-project/pull/200775