[Mlir-commits] [mlir] [MLIR][NVVM] Move the docs to markdown file (PR #168375)

Mon Nov 17 06:17:49 PST 2025

https://github.com/grypp created https://github.com/llvm/llvm-project/pull/168375

None

>From bbf17fdd1569106509f62db2002be3979529f4c5 Mon Sep 17 00:00:00 2001
From: Guray Ozen <gozen at nvidia.com>
Date: Fri, 14 Nov 2025 14:38:39 +0100
Subject: [PATCH 1/3] [NVVM] Make nanosleep op duration SSA value

---
 mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td | 38 +++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
index 1cc5b74a3cb67..a7d0d05dd8192 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
@@ -79,6 +79,44 @@ def NVVM_Dialect : Dialect {
     sequence must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to
     embed PTX inline as a last-resort escape hatch, with explicit operands and
     results.
+
+
+    **Memory Spaces:** The NVVM dialect introduces the following memory spaces,
+    each with distinct scopes and lifetimes:
+
+    | Memory Space      | Scope                | Lifetime          |
+    |-------------------|----------------------|-------------------|
+    | `generic`         | All threads          | Context-dependent |
+    | `global`          | All threads (device) | Application       |
+    | `shared`          | Thread block (CTA)   | Kernel execution  |
+    | `constant`        | All threads (RO)     | Application       |
+    | `local`           | Single thread        | Kernel execution  |
+    | `tensor`          | Thread block (CTA)   | Kernel execution  |
+    | `shared_cluster`  | Thread block cluster | Kernel execution  |
+
+    **Memory Space Details:**
+    - **generic**: Can point to any memory space; requires runtime resolution of
+      actual address space. Use when pointer origin is unknown at compile time.
+      Performance varies based on the underlying memory space.
+    - **global**: Accessible by all threads across all blocks; persists across
+      kernel launches. Highest latency but largest capacity (device memory). Best
+      for large data and inter-kernel communication.
+    - **shared**: Shared within a thread block (CTA); very fast on-chip memory for
+      cooperation between threads in the same block. Limited capacity. Ideal for 
+      block-level collaboration, caching, and reducing global memory traffic.
+    - **constant**: Read-only memory cached per SM; optimized for broadcast
+      patterns where all threads access the same location. Fast access when cached.
+      Size typically limited to 64KB. Best for read-only data and uniform values
+      accessed by all threads.
+    - **local**: Private to each thread; used for stack frames and register spills.
+      Actually resides in global memory but cached in L1. Use for per-thread
+      private data and automatic variables that don't fit in registers.
+    - **tensor**: Special memory space for Tensor Memory Accelerator (TMA)
+      operations on SM 80+ architectures; used with async tensor operations and
+      wgmma instructions. Provides very fast access for matrix operations.
+    - **shared_cluster**: Shared across thread blocks within a cluster (SM 90+);
+      enables collaboration beyond single-block scope with distributed shared
+      memory. Fast access across cluster threads.
   }];
 
   let name = "nvvm";

>From 891547d701dc50de08d77a5fe673bf0e8bc1dc0f Mon Sep 17 00:00:00 2001
From: Guray Ozen <gozen at nvidia.com>
Date: Mon, 17 Nov 2025 15:07:06 +0100
Subject: [PATCH 2/3] fx

---
 mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td | 40 ++++++++++-----------
 1 file changed, 18 insertions(+), 22 deletions(-)

diff --git a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
index a7d0d05dd8192..119c8caa32c31 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
@@ -84,15 +84,15 @@ def NVVM_Dialect : Dialect {
     **Memory Spaces:** The NVVM dialect introduces the following memory spaces,
     each with distinct scopes and lifetimes:
 
-    | Memory Space      | Scope                | Lifetime          |
-    |-------------------|----------------------|-------------------|
-    | `generic`         | All threads          | Context-dependent |
-    | `global`          | All threads (device) | Application       |
-    | `shared`          | Thread block (CTA)   | Kernel execution  |
-    | `constant`        | All threads (RO)     | Application       |
-    | `local`           | Single thread        | Kernel execution  |
-    | `tensor`          | Thread block (CTA)   | Kernel execution  |
-    | `shared_cluster`  | Thread block cluster | Kernel execution  |
+    | Memory Space      | Address Space | Scope                | Lifetime          |
+    |-------------------|---------------|----------------------|-------------------|
+    | `generic`         | 0             | All threads          | Context-dependent |
+    | `global`          | 1             | All threads (device) | Application       |
+    | `shared`          | 3             | Thread block (CTA)   | Kernel execution  |
+    | `constant`        | 4             | All threads (RO)     | Application       |
+    | `local`           | 5             | Single thread        | Kernel execution  |
+    | `tensor`          | 6             | Thread block (CTA)   | Kernel execution  |
+    | `shared_cluster`  | 7             | Thread block cluster | Kernel execution  |
 
     **Memory Space Details:**
     - **generic**: Can point to any memory space; requires runtime resolution of
@@ -104,19 +104,15 @@ def NVVM_Dialect : Dialect {
     - **shared**: Shared within a thread block (CTA); very fast on-chip memory for
       cooperation between threads in the same block. Limited capacity. Ideal for 
       block-level collaboration, caching, and reducing global memory traffic.
-    - **constant**: Read-only memory cached per SM; optimized for broadcast
-      patterns where all threads access the same location. Fast access when cached.
-      Size typically limited to 64KB. Best for read-only data and uniform values
-      accessed by all threads.
-    - **local**: Private to each thread; used for stack frames and register spills.
-      Actually resides in global memory but cached in L1. Use for per-thread
-      private data and automatic variables that don't fit in registers.
-    - **tensor**: Special memory space for Tensor Memory Accelerator (TMA)
-      operations on SM 80+ architectures; used with async tensor operations and
-      wgmma instructions. Provides very fast access for matrix operations.
-    - **shared_cluster**: Shared across thread blocks within a cluster (SM 90+);
-      enables collaboration beyond single-block scope with distributed shared
-      memory. Fast access across cluster threads.
+    - **constant**: Read-only memory cached per SM. Size typically limited to 
+      64KB. Best for read-only data and uniform values accessed by all threads.
+    - **local**: Private to each thread. Use for per-thread private data and
+      automatic variables that don't fit in registers.
+    - **tensor**: Special memory space for tensor core operations. Used by
+      `tcgen05` instructions on SM 100+ for tensor input/output operations.
+    - **shared_cluster**: Distributed shared memory across thread blocks within
+      a cluster (SM 90+). Enables collaboration beyond single-block scope with
+      fast access across cluster threads.
   }];
 
   let name = "nvvm";

>From 8fa3c707c7c7cf4d1ffd5c861c248387dfd8d9c5 Mon Sep 17 00:00:00 2001
From: Guray Ozen <gozen at nvidia.com>
Date: Mon, 17 Nov 2025 15:17:10 +0100
Subject: [PATCH 3/3] [MLIR][NVVM] Move the docs into markdown file

---
 mlir/docs/Dialects/NVVM/_index.md           | 84 +++++++++++++++++++++
 mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td | 77 -------------------
 2 files changed, 84 insertions(+), 77 deletions(-)
 create mode 100644 mlir/docs/Dialects/NVVM/_index.md

diff --git a/mlir/docs/Dialects/NVVM/_index.md b/mlir/docs/Dialects/NVVM/_index.md
new file mode 100644
index 0000000000000..d8b64cda6054b
--- /dev/null
+++ b/mlir/docs/Dialects/NVVM/_index.md
@@ -0,0 +1,84 @@
+# NVVM Dialect
+
+The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It
+models NVVM intrinsics and public ISA functionality and introduces NVIDIA
+extensions to the MLIR/LLVM type system and address spaces (e.g., global,
+shared, and cluster memory), enabling faithful lowering of GPU kernels to the
+NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic,
+the NVVM dialect uses type polymorphism and other attributes so that a single
+NVVM op can map to different LLVM intrinsics.
+
+## Scope and Capabilities
+
+The dialect covers core GPU features such as thread/block builtins, barriers
+and atomics, warp-level collectives (e.g., shuffle/vote), matrix/tensor core
+operations (e.g., `mma.sync`, `wgmma`), tensor memory accelerator (TMA)
+operations, asynchronous copies (`cp.async`, bulk/tensor variants) with memory
+barriers, cache and prefetch controls, and NVVM-specific attributes and enums
+(e.g., FP rounding modes, memory scopes, and MMA types/layouts).
+
+## Placement in the Lowering Pipeline
+
+NVVM sits below target-agnostic dialects like `gpu` and NVIDIA's `nvgpu`.
+Typical pipelines convert `gpu`/`nvgpu` ops into NVVM using
+`-convert-gpu-to-nvvm` and `-convert-nvgpu-to-nvvm`, then translate into LLVM
+for final code generation via NVPTX backend.
+
+## Target Configuration and Serialization
+
+NVVM provides a `#nvvm.target` attribute to describe the GPU target (SM,
+features, and flags). In conjunction with `gpu` serialization (e.g.,
+`gpu-module-to-binary`), this enables producing architecture-specific GPU
+binaries (such as CUBIN) from nested GPU modules.
+
+## Inline PTX
+
+When an intrinsic is unavailable or a performance-critical sequence must be
+expressed directly, NVVM provides an `nvvm.inline_ptx` op to embed PTX inline
+as a last-resort escape hatch, with explicit operands and results.
+
+## Memory Spaces
+
+The NVVM dialect introduces the following memory spaces, each with distinct
+scopes and lifetimes:
+
+| Memory Space      | Address Space | Scope                | Lifetime          |
+|-------------------|---------------|----------------------|-------------------|
+| `generic`         | 0             | All threads          | Context-dependent |
+| `global`          | 1             | All threads (device) | Application       |
+| `shared`          | 3             | Thread block (CTA)   | Kernel execution  |
+| `constant`        | 4             | All threads (RO)     | Application       |
+| `local`           | 5             | Single thread        | Kernel execution  |
+| `tensor`          | 6             | Thread block (CTA)   | Kernel execution  |
+| `shared_cluster`  | 7             | Thread block cluster | Kernel execution  |
+
+### Memory Space Details
+
+- **generic**: Can point to any memory space; requires runtime resolution of
+  actual address space. Use when pointer origin is unknown at compile time.
+  Performance varies based on the underlying memory space.
+- **global**: Accessible by all threads across all blocks; persists across
+  kernel launches. Highest latency but largest capacity (device memory). Best
+  for large data and inter-kernel communication.
+- **shared**: Shared within a thread block (CTA); very fast on-chip memory for
+  cooperation between threads in the same block. Limited capacity. Ideal for
+  block-level collaboration, caching, and reducing global memory traffic.
+- **constant**: Read-only memory cached per SM. Size typically limited to 64KB.
+  Best for read-only data and uniform values accessed by all threads.
+- **local**: Private to each thread. Use for per-thread private data and
+  automatic variables that don't fit in registers.
+- **tensor**: Special memory space for tensor core operations. Used by
+  `tcgen05` instructions on SM 100+ for tensor input/output operations.
+- **shared_cluster**: Distributed shared memory across thread blocks within a
+  cluster (SM 90+). Enables collaboration beyond single-block scope with fast
+  access across cluster threads.
+
+
+## Non-Goals
+
+NVVM is not a place for convenience or "wrapper" ops. It is not intended to
+introduce high-level ops that expand into multiple unrelated NVVM intrinsics or
+that lower to no intrinsic at all. Such abstractions belong in higher-level
+dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects). The design
+intent is a thin, predictable, low-level surface with near-mechanical lowering
+to NVVM/LLVM IR.
\ No newline at end of file
diff --git a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
index 119c8caa32c31..627a6bfe21eef 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
@@ -38,83 +38,6 @@ def LLVM_PointerSharedCluster : LLVM_PointerInAddressSpace<7>;
 
 def NVVM_Dialect : Dialect {
   let summary = "The NVVM dialect that models NVIDIA's public ISA";
-
-  let description = [{
-    The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It
-    models NVVM intrinsics and public ISA functionality and introduces NVIDIA
-    extensions to the MLIR/LLVM type system and address spaces (e.g., global,
-    shared, and cluster memory), enabling faithful lowering of GPU kernels to the
-    NVPTX toolchain. While a NVVM op usually maps to a single LLVM IR intrinsic,
-    the NVVM dialect uses type polymorphism and other attributes so that a single
-    NVVM op can map to different LLVM intrinsics.
-
-    **Scope and capabilities:** The dialect covers core GPU features such as
-    thread/block builtins, barriers and atomics, warp-level collectives (e.g.,
-    shuffle/vote), matrix/tensor core operations (e.g., `mma.sync`, `wgmma`),
-    tensor memory accelerator (TMA) operations, asynchronous copies (`cp.async`,
-    bulk/tensor variants) with memory barriers, cache and prefetch controls, and
-    NVVM-specific attributes and enums (e.g., FP rounding modes, memory scopes,
-    and MMA types/layouts).
-
-    **Non-goals:** NVVM is not a place for convenience or “wrapper” ops. It is
-    not intended to introduce high-level ops that expand into multiple unrelated
-    NVVM intrinsics or that lower to no intrinsic at all. Such abstractions belong
-    in higher-level dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects).
-    The design intent is a thin, predictable, low-level surface with
-    near-mechanical lowering to NVVM/LLVM IR.
-
-    **Placement in the lowering pipeline:** NVVM sits below target-agnostic
-    dialects like `gpu` and NVIDIA's `nvgpu`. Typical pipelines convert
-    `gpu`/`nvgpu` ops into NVVM using `-convert-gpu-to-nvvm` and
-    `-convert-nvgpu-to-nvvm`, then translate into LLVM for final code
-    generation via NVPTX backend.
-
-    **Target configuration and serialization:** NVVM provides a `#nvvm.target`
-    attribute to describe the GPU target (SM, features, and flags). In
-    conjunction with `gpu` serialization (e.g., `gpu-module-to-binary`), this
-    enables producing architecture-specific GPU binaries (such as CUBIN) from
-    nested GPU modules.
-
-    **Inline PTX:** When an intrinsic is unavailable or a performance-critical
-    sequence must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to
-    embed PTX inline as a last-resort escape hatch, with explicit operands and
-    results.
-
-
-    **Memory Spaces:** The NVVM dialect introduces the following memory spaces,
-    each with distinct scopes and lifetimes:
-
-    | Memory Space      | Address Space | Scope                | Lifetime          |
-    |-------------------|---------------|----------------------|-------------------|
-    | `generic`         | 0             | All threads          | Context-dependent |
-    | `global`          | 1             | All threads (device) | Application       |
-    | `shared`          | 3             | Thread block (CTA)   | Kernel execution  |
-    | `constant`        | 4             | All threads (RO)     | Application       |
-    | `local`           | 5             | Single thread        | Kernel execution  |
-    | `tensor`          | 6             | Thread block (CTA)   | Kernel execution  |
-    | `shared_cluster`  | 7             | Thread block cluster | Kernel execution  |
-
-    **Memory Space Details:**
-    - **generic**: Can point to any memory space; requires runtime resolution of
-      actual address space. Use when pointer origin is unknown at compile time.
-      Performance varies based on the underlying memory space.
-    - **global**: Accessible by all threads across all blocks; persists across
-      kernel launches. Highest latency but largest capacity (device memory). Best
-      for large data and inter-kernel communication.
-    - **shared**: Shared within a thread block (CTA); very fast on-chip memory for
-      cooperation between threads in the same block. Limited capacity. Ideal for 
-      block-level collaboration, caching, and reducing global memory traffic.
-    - **constant**: Read-only memory cached per SM. Size typically limited to 
-      64KB. Best for read-only data and uniform values accessed by all threads.
-    - **local**: Private to each thread. Use for per-thread private data and
-      automatic variables that don't fit in registers.
-    - **tensor**: Special memory space for tensor core operations. Used by
-      `tcgen05` instructions on SM 100+ for tensor input/output operations.
-    - **shared_cluster**: Distributed shared memory across thread blocks within
-      a cluster (SM 90+). Enables collaboration beyond single-block scope with
-      fast access across cluster threads.
-  }];
-
   let name = "nvvm";
   let cppNamespace = "::mlir::NVVM";
   let dependentDialects = ["LLVM::LLVMDialect"];