[clang] [llvm] [AMDGPU] Track tensor load/store DMAs with asyncmark (PR #200775)

Thu Jun 4 03:25:43 PDT 2026

================
@@ -50,6 +50,19 @@ memory and LDS memory.
   void @llvm.amdgcn.global.store.async.from.lds.type(ptr %dst, ptr %src)
   void @llvm.amdgcn.cluster.load.async.to.lds.type(ptr %dst, ptr %src)
 
+**GFX1250 Tensor DMA Instructions**
+
+.. code-block:: llvm
+
+  void @llvm.amdgcn.tensor.load.to.lds(...)
+  void @llvm.amdgcn.tensor.store.from.lds(...)
+
+These intrinsics are asynchronous despite the absence of ``async`` in their
+names. They are tracked by the ``TENSOR_CNT`` hardware counter and participate
+in the ``asyncmark`` / ``wait.asyncmark`` framework just like the intrinsics
+above. Equivalently, the caller may issue an explicit ``s_wait_tensorcnt``
+instead of using ``asyncmark`` / ``wait.asyncmark``.
----------------
ssahasra wrote:

Remove this whole paragraph. Too much information. The whole point of `asyncmark` is to abstract away details like `TENSOR_CNT`. If users need the old way of doing things, they will have to go read the ISA doc for that.

https://github.com/llvm/llvm-project/pull/200775