[Mlir-commits] [mlir] [MLIR][NVVM] Add definition for nvvm dialect (PR #156807)

Wed Sep 3 23:05:03 PDT 2025

llvmbot wrote:




@llvm/pr-subscribers-mlir-llvm

Author: Guray Ozen (grypp)

<details>
<summary>Changes</summary>



---
Full diff: https://github.com/llvm/llvm-project/pull/156807.diff


1 Files Affected:

- (modified) mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td (+42-2) 


``````````diff

diff --git a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
index 8537c7030aa8f..3dd6baa058d55 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
@@ -35,6 +35,46 @@ def LLVM_PointerSharedCluster : LLVM_PointerInAddressSpace<7>;
 //===----------------------------------------------------------------------===//
 
 def NVVM_Dialect : Dialect {
+  let summary = "The NVVM dialect that models NVIDIA's public ISA";
+
+  let description = [{
+  The NVVM dialect is MLIR's LLVM-IR-based, NVIDIA-specific backend dialect. It
+  models NVVM intrinsics and public ISA functionality and introduces NVIDIA
+  extensions to the LLVM type system and address spaces (e.g., global, shared,
+  and cluster memory), enabling faithful lowering of GPU kernels to the NVPTX
+  toolchain. Many ops have a one-to-many mapping to NVVM/PTX: a single overloaded op
+  emits one intrinsic, selected by its operand types.
+
+  **Scope and capabilities:** The dialect covers core GPU features such as
+  thread/block builtins, barriers and atomics, warp-level collectives (e.g.,
+  shuffle/vote), matrix/tensor core operations (e.g., `mma.sync`, `wgmma`),
+  tensor memory accelerator (TMA) operations, asynchronous copies (`cp.async`,
+  bulk/tensor variants) with memory barriers, cache and prefetch controls, and
+  NVVM-specific attributes and enums (e.g., FP rounding modes, memory scopes, and
+  MMA types/layouts).
+
+  **Non-goals:** NVVM is not a place for convenience or “wrapper” ops. It is not
+  intended to introduce high-level ops that expand into multiple unrelated NVVM
+  intrinsics or that lower to no intrinsic at all. Such abstractions belong in
+  higher-level dialects (e.g., `nvgpu`, `gpu`, or project-specific dialects). The
+  design intent is a thin, predictable, low-level surface with near-mechanical
+  lowering to NVVM/LLVM IR.
+
+  **Placement in the lowering pipeline:** NVVM sits below target-agnostic dialects
+  like `gpu` and NVIDIA's `nvgpu`. Typical pipelines convert `gpu`/`nvgpu` ops
+  into NVVM using `-convert-gpu-to-nvvm` and `-convert-nvgpu-to-nvvm`, then lower
+  to the LLVM dialect for final code generation via LLVM's NVPTX backend.
+
+  **Target configuration and serialization:** NVVM provides a `#nvvm.target`
+  attribute to describe the GPU target (SM, features, and flags). In conjunction
+  with `gpu` serialization (e.g., `gpu-module-to-binary`), this enables producing
+  architecture-specific GPU binaries (such as CUBIN) from nested GPU modules.
+
+  **Inline PTX:** When an intrinsic is unavailable or a performance-critical sequence
+  must be expressed directly, NVVM provides an `nvvm.inline_ptx` op to embed PTX
+  inline as a last-resort escape hatch, with explicit operands and results.
+  }];
+
   let name = "nvvm";
   let cppNamespace = "::mlir::NVVM";
   let dependentDialects = ["LLVM::LLVMDialect"];
@@ -976,7 +1016,7 @@ def NVVM_ShflOp :
   let description = [{
     The `shfl.sync` Op implements data shuffle within threads of a warp.
     The `thread_mask` denotes the threads participating in the Op where
-    the bit position corresponds to a particular thread’s laneid.
+    the bit position corresponds to a particular thread's laneid.
     The `offset` specifies a source lane or source lane offset
     (depending on `kind`). The `val` is the input value to be copied from
     the source. The `mask_and_clamp` contains two packed values specifying
@@ -1031,7 +1071,7 @@ def NVVM_VoteSyncOp
     - `ballot`: In the ballot form, the destination result is a 32 bit integer.
       In this form, the predicate from each thread in membermask are copied into
       the corresponding bit position of the result, where the bit position
-      corresponds to the thread’s lane id.
+      corresponds to the thread's lane id.
 
     [For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#parallel-synchronization-and-communication-instructions-vote-sync)
   }];

``````````

</details>


https://github.com/llvm/llvm-project/pull/156807