[Mlir-commits] [mlir] [mlir][docs] Clarified Dialect creation tutorial + fixed typos (PR #77820)

Perry Gibson llvmlistbot at llvm.org
Thu Jan 11 11:33:16 PST 2024


https://github.com/Wheest created https://github.com/llvm/llvm-project/pull/77820

This PR adds clarification to the ["Creating a Dialect" tutorial](https://mlir.llvm.org/docs/Tutorials/CreatingADialect/) regarding how to register the dialect (especially globally).

 The motivation for this came from trying to add my own dialect, and experiencing some friction since it had been a while.
 [This forum post](https://discourse.llvm.org/t/how-to-register-a-dialect/66848/2) discusses the issue, but doesn't talk about the global reigstration approach.

 [I created my own blogpost](https://gibsonic.org/blog/2024/01/11/new_mlir_dialect.html) which goes through a minimum working example, with stripped down TableGen files, but this level of detail might be overkill for this tutorial.  However, I could add something like this if it is thought to be helpful

 This PR also includes some spelling corrections across the mlir directory.  The methodology for this was conservative, using the [codespell tool](https://github.com/codespell-project/codespell), and using my own discretion for correction.  Corrections were only on comments and docs.  To keep this PR small, only a fraction of the identified errors were corrected

>From fe4cce6744d1e0feeac403874a5f79f907152377 Mon Sep 17 00:00:00 2001
From: Perry Gibson <perry at gibsonic.org>
Date: Thu, 11 Jan 2024 19:46:20 +0100
Subject: [PATCH 1/2] Added registration to CreatingADialect tutorial

---
 mlir/docs/DefiningDialects/_index.md    | 30 ++++++++++++-------------
 mlir/docs/Tutorials/CreatingADialect.md | 14 ++++++++----
 mlir/utils/emacs/mlir-lsp-client.el     |  4 ++--
 3 files changed, 27 insertions(+), 21 deletions(-)

diff --git a/mlir/docs/DefiningDialects/_index.md b/mlir/docs/DefiningDialects/_index.md
index 5a3993508fce88..78117a7f74358f 100644
--- a/mlir/docs/DefiningDialects/_index.md
+++ b/mlir/docs/DefiningDialects/_index.md
@@ -43,7 +43,7 @@ extends to all of the MLIR constructs, including [Interfaces](../Interfaces.md)
 
 ```tablegen
 // Include the definition of the necessary tablegen constructs for defining
-// our dialect. 
+// our dialect.
 include "mlir/IR/DialectBase.td"
 
 // Here is a simple definition of a dialect.
@@ -84,7 +84,7 @@ void MyDialect::initialize() {
 
 The `summary` and `description` fields allow for providing user documentation
 for the dialect. The `summary` field expects a simple single-line string, with the
-`description` field used for long and extensive documentation. This documentation can be 
+`description` field used for long and extensive documentation. This documentation can be
 used to generate markdown documentation for the dialect and is used by upstream
 [MLIR dialects](https://mlir.llvm.org/docs/Dialects/).
 
@@ -133,7 +133,7 @@ void MyOp::setOtherValue(StringAttr newValue);
 
 ### Dependent Dialects
 
-MLIR has a very large ecosystem, and contains dialects that server many different purposes. It
+MLIR has a very large ecosystem, and contains dialects that serve many different purposes. It
 is quite common, given the above, that dialects may want to reuse certain components from other
 dialects. This may mean generating operations from those dialects during canonicalization, reusing
 attributes or types, etc. When a dialect has a dependency on another, i.e. when it constructs and/or
@@ -230,7 +230,7 @@ is verified. The hook necessary for the dialect to implement has the form:
 /// Verify the use of the given attribute, whose name is prefixed by the namespace of this
 /// dialect, that was used on the attribute dictionary of a region entry block argument.
 /// Note: As described above, when a region entry block has a dictionary is up to the individual
-/// operation to define. 
+/// operation to define.
 LogicalResult MyDialect::verifyRegionArgAttribute(Operation *op, unsigned regionIndex,
                                                   unsigned argIndex, NamedAttribute attribute);
 ```
@@ -250,7 +250,7 @@ has the form:
 /// Generate verification for the given attribute, whose name is prefixed by the namespace
 /// of this dialect, that was used on the attribute dictionary of a region result.
 /// Note: As described above, when a region entry block has a dictionary is up to the individual
-/// operation to define. 
+/// operation to define.
 LogicalResult MyDialect::verifyRegionResultAttribute(Operation *op, unsigned regionIndex,
                                                      unsigned argIndex, NamedAttribute attribute);
 ```
@@ -258,8 +258,8 @@ LogicalResult MyDialect::verifyRegionResultAttribute(Operation *op, unsigned reg
 ### Operation Interface Fallback
 
 Some dialects have an open ecosystem and don't register all of the possible operations. In such
-cases it is still possible to provide support for implementing an `OpInterface` for these 
-operations. When an operation isn't registered or does not provide an implementation for an 
+cases it is still possible to provide support for implementing an `OpInterface` for these
+operations. When an operation isn't registered or does not provide an implementation for an
 interface, the query will fallback to the dialect itself. The `hasOperationInterfaceFallback`
 field may be used to declare this fallback for operations:
 
@@ -269,10 +269,10 @@ field may be used to declare this fallback for operations:
 void *MyDialect::getRegisteredInterfaceForOp(TypeID typeID, StringAttr opName);
 ```
 
-For a more detail description of the expected usages of this hook, view the detailed 
+For a more detail description of the expected usages of this hook, view the detailed
 [interface documentation](../Interfaces.md#dialect-fallback-for-opinterface).
 
-### Default Attribute/Type Parsers and Printers 
+### Default Attribute/Type Parsers and Printers
 
 When a dialect registers an Attribute or Type, it must also override the respective
 `Dialect::parseAttribute`/`Dialect::printAttribute` or
@@ -286,7 +286,7 @@ parser and printer of its Attributes and Types it should set these to `0` as nec
 
 ### Dialect-wide Canonicalization Patterns
 
-Generally, [canonicalization](../Canonicalization.md) patterns are specific to individual 
+Generally, [canonicalization](../Canonicalization.md) patterns are specific to individual
 operations within a dialect. There are some cases, however, that prompt canonicalization
 patterns to be added to the dialect-level. For example, if a dialect defines a canonicalization
 pattern that operates on an interface or trait, it can be beneficial to only add this pattern
@@ -514,7 +514,7 @@ AbstractOperation::VerifyInvariantsFn verifyFn = [](Operation* op) {
 AbstractOperation::ParseAssemblyFn parseFn =
     [](OpAsmParser &parser, OperationState &state) {
         // Parse the operation, given that the name is already parsed.
-        ...    
+        ...
 };
 
 // Printer function
@@ -526,14 +526,14 @@ auto printFn = [](Operation *op, OpAsmPrinter &printer) {
 
 // General folder implementation, see AbstractOperation::foldHook for more
 // information.
-auto foldHookFn = [](Operation * op, ArrayRef<Attribute> operands, 
+auto foldHookFn = [](Operation * op, ArrayRef<Attribute> operands,
                                    SmallVectorImpl<OpFoldResult> &result) {
     ...
 };
 
 // Returns any canonicalization pattern rewrites that the operation
 // supports, for use by the canonicalization pass.
-auto getCanonicalizationPatterns = 
+auto getCanonicalizationPatterns =
         [](RewritePatternSet &results, MLIRContext *context) {
     ...
 }
@@ -635,7 +635,7 @@ though overriden `parseType` methods need to add the necessary support for them.
 ```c++
 Type MyDialect::parseType(DialectAsmParser &parser) const {
     ...
-    
+
     // The type name.
     StringRef typeTag;
     if (failed(parser.parseKeyword(&typeTag)))
@@ -649,7 +649,7 @@ Type MyDialect::parseType(DialectAsmParser &parser) const {
             return dynType;
          return Type();
     }
-    
+
     ...
 }
 ```
diff --git a/mlir/docs/Tutorials/CreatingADialect.md b/mlir/docs/Tutorials/CreatingADialect.md
index af709fc46eff51..6ea56a0fcdefdd 100644
--- a/mlir/docs/Tutorials/CreatingADialect.md
+++ b/mlir/docs/Tutorials/CreatingADialect.md
@@ -10,10 +10,10 @@ Public dialects are typically separated into at least 3 directories:
 * mlir/test/Dialect/Foo           (for tests)
 
 Along with other public headers, the 'include' directory contains a
-TableGen file in the [ODS format](../DefiningDialects/Operations.md), describing the
-operations in the dialect.  This is used to generate operation
-declarations (FooOps.h.inc) and definitions (FooOps.cpp.inc) and
-operation interface declarations (FooOpsInterfaces.h.inc) and
+TableGen file in the [ODS format](../DefiningDialects/Operations.md),
+describing the operations in the dialect.  This is used to generate
+operation declarations (FooOps.h.inc) and definitions (FooOps.cpp.inc)
+and operation interface declarations (FooOpsInterfaces.h.inc) and
 definitions (FooOpsInterfaces.cpp.inc).
 
 The 'IR' directory typically contains implementations of functions for
@@ -106,6 +106,12 @@ the LINK_COMPONENTS descriptor.  This allows cmake infrastructure to
 generate new library targets with correct linkage, in particular, when
 BUILD_SHARED_LIBS=on or LLVM_LINK_LLVM_DYLIB=on are specified.
 
+Registration of the dialect can be performed globally by editing the
+file include/mlir/InitAllDialects.h.  In this example, we can add
+`foo::FooDialect` to the `registry.insert` operation.  This will
+make the dialect available to all MLIR programs which initiate their
+registries with `registerAllDialects` (e.g. mlir-opt).  Alternatively,
+we can register the dialect locally in our required program.
 
 # Dialect Conversions
 
diff --git a/mlir/utils/emacs/mlir-lsp-client.el b/mlir/utils/emacs/mlir-lsp-client.el
index 4397a55e7206ac..8e8f1f2d0c56f2 100644
--- a/mlir/utils/emacs/mlir-lsp-client.el
+++ b/mlir/utils/emacs/mlir-lsp-client.el
@@ -1,4 +1,4 @@
-;;; mlir-lsp-clinet.el --- LSP clinet for the MLIR.
+;;; mlir-lsp-client.el --- LSP client for the MLIR.
 
 ;; Copyright (C) 2022 The MLIR Authors.
 ;;
@@ -18,7 +18,7 @@
 
 ;;; Commentary:
 
-;; LSP clinet to use with `mlir-mode' that uses `mlir-lsp-server' or any
+;; LSP client to use with `mlir-mode' that uses `mlir-lsp-server' or any
 ;; user made compatible server.
 
 ;;; Code:

>From 216248c81a562d74bc6fcd5f1d696ec9f61d4ec2 Mon Sep 17 00:00:00 2001
From: Perry Gibson <perry at gibsonic.org>
Date: Thu, 11 Jan 2024 20:08:23 +0100
Subject: [PATCH 2/2] aUpdated spelling errors in docs/comments

---
 .../mlir/Analysis/DataFlow/DenseAnalysis.h    |   4 +-
 .../Analysis/DataFlow/IntegerRangeAnalysis.h  |   2 +-
 .../include/mlir/Analysis/DataFlowFramework.h |   4 +-
 mlir/include/mlir/Analysis/Presburger/MPInt.h |   2 +-
 .../include/mlir/Analysis/Presburger/Matrix.h |   2 +-
 .../mlir/Analysis/Presburger/PWMAFunction.h   |   4 +-
 .../Analysis/Presburger/PresburgerRelation.h  |   2 +-
 .../mlir/Analysis/Presburger/Simplex.h        |   8 +-
 .../mlir/Dialect/Affine/IR/AffineOps.h        |  16 +-
 mlir/include/mlir/Dialect/Async/Passes.td     |   6 +-
 .../Dialect/GPU/IR/CompilationInterfaces.h    |   2 +-
 mlir/include/mlir/Dialect/GPU/IR/GPUBase.td   |   4 +-
 mlir/include/mlir/Dialect/GPU/IR/GPUOps.td    |   2 +-
 .../GPU/TransformOps/GPUDeviceMappingAttr.td  |  98 ++++-----
 mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td   | 192 +++++++++---------
 .../mlir/Dialect/LLVMIR/Transforms/Passes.td  |   4 +-
 mlir/include/mlir/Dialect/Math/IR/MathOps.td  |   2 +-
 .../mlir/Dialect/MemRef/IR/MemRefOps.td       |   4 +-
 .../Dialect/MemRef/Transforms/Transforms.h    |   2 +-
 mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td   | 164 +++++++--------
 .../Dialect/NVGPU/Transforms/Transforms.h     |   2 +-
 .../mlir/Dialect/SparseTensor/IR/Enums.h      |   2 +-
 .../SparseTensor/IR/SparseTensorOps.td        |   8 +-
 .../Dialect/SparseTensor/Transforms/Passes.td |   2 +-
 .../mlir/ExecutionEngine/ExecutionEngine.h    |   2 +-
 .../mlir/ExecutionEngine/MemRefUtils.h        |   2 +-
 .../ExecutionEngine/SparseTensorRuntime.h     |   2 +-
 mlir/python/mlir/dialects/func.py             |   2 +-
 .../mlir/dialects/linalg/opdsl/lang/affine.py |   4 +-
 29 files changed, 277 insertions(+), 273 deletions(-)

diff --git a/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h b/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h
index 088b6cd7d698fc..b72d903a5edbe6 100644
--- a/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h
+++ b/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h
@@ -293,7 +293,7 @@ class DenseForwardDataFlowAnalysis
 
 /// Base class for dense backward dataflow analyses. Such analyses attach a
 /// lattice between the execution of operations and implement a transfer
-/// function from the lattice after the operation ot the lattice before it, thus
+/// function from the lattice after the operation on the lattice before it, thus
 /// propagating backward.
 ///
 /// In this implementation, a lattice attached to an operation represents the
@@ -426,7 +426,7 @@ class AbstractDenseBackwardDataFlowAnalysis : public DataFlowAnalysis {
 
 /// A dense backward dataflow analysis propagating lattices after and before the
 /// execution of every operation across the IR by implementing transfer
-/// functions for opreations.
+/// functions for operations.
 ///
 /// `LatticeT` is expected to be a subclass of `AbstractDenseLattice`.
 template <typename LatticeT>
diff --git a/mlir/include/mlir/Analysis/DataFlow/IntegerRangeAnalysis.h b/mlir/include/mlir/Analysis/DataFlow/IntegerRangeAnalysis.h
index 8bd7cf880c6afb..811ecf2835b268 100644
--- a/mlir/include/mlir/Analysis/DataFlow/IntegerRangeAnalysis.h
+++ b/mlir/include/mlir/Analysis/DataFlow/IntegerRangeAnalysis.h
@@ -92,7 +92,7 @@ class IntegerRangeAnalysis
 public:
   using SparseForwardDataFlowAnalysis::SparseForwardDataFlowAnalysis;
 
-  /// At an entry point, we cannot reason about interger value ranges.
+  /// At an entry point, we cannot reason about integer value ranges.
   void setToEntryState(IntegerValueRangeLattice *lattice) override {
     propagateIfChanged(lattice, lattice->join(IntegerValueRange::getMaxRange(
                                     lattice->getPoint())));
diff --git a/mlir/include/mlir/Analysis/DataFlowFramework.h b/mlir/include/mlir/Analysis/DataFlowFramework.h
index c76cfac07fc77a..1c594068fc8246 100644
--- a/mlir/include/mlir/Analysis/DataFlowFramework.h
+++ b/mlir/include/mlir/Analysis/DataFlowFramework.h
@@ -305,7 +305,7 @@ class DataFlowSolver {
 /// these requirements.
 ///
 /// 1. Querying the state of a program point prior to visiting that point
-///    results in uninitialized state. Analyses must be aware of unintialized
+///    results in uninitialized state. Analyses must be aware of uninitialized
 ///    states.
 /// 2. Analysis states can reach fixpoints, where subsequent updates will never
 ///    trigger a change in the state.
@@ -462,7 +462,7 @@ class DataFlowAnalysis {
   const DataFlowConfig &getSolverConfig() const { return solver.getConfig(); }
 
 #if LLVM_ENABLE_ABI_BREAKING_CHECKS
-  /// When compiling with debugging, keep a name for the analyis.
+  /// When compiling with debugging, keep a name for the analysis.
   StringRef debugName;
 #endif // LLVM_ENABLE_ABI_BREAKING_CHECKS
 
diff --git a/mlir/include/mlir/Analysis/Presburger/MPInt.h b/mlir/include/mlir/Analysis/Presburger/MPInt.h
index 12ab0598d10d99..fb95b6c19c487d 100644
--- a/mlir/include/mlir/Analysis/Presburger/MPInt.h
+++ b/mlir/include/mlir/Analysis/Presburger/MPInt.h
@@ -29,7 +29,7 @@ namespace presburger {
 /// identically-named functions that operate on MPInts, which would otherwie
 /// become the only candidates of overload resolution when calling e.g. ceilDiv
 /// from the mlir::presburger namespace. So to access the 64-bit overloads, an
-/// explict call to mlir::ceilDiv would be required. These using declarations
+/// explicit call to mlir::ceilDiv would be required. These using declarations
 /// allow overload resolution to transparently call the right function.
 using ::mlir::ceilDiv;
 using ::mlir::floorDiv;
diff --git a/mlir/include/mlir/Analysis/Presburger/Matrix.h b/mlir/include/mlir/Analysis/Presburger/Matrix.h
index 38fac50c13536e..1d454a598c6d43 100644
--- a/mlir/include/mlir/Analysis/Presburger/Matrix.h
+++ b/mlir/include/mlir/Analysis/Presburger/Matrix.h
@@ -46,7 +46,7 @@ class Matrix {
 
   /// Construct a matrix with the specified number of rows and columns.
   /// The number of reserved rows and columns will be at least the number
-  /// specified, and will always be sufficient to accomodate the number of rows
+  /// specified, and will always be sufficient to accommodate the number of rows
   /// and columns specified.
   ///
   /// Initially, the entries are initialized to ero.
diff --git a/mlir/include/mlir/Analysis/Presburger/PWMAFunction.h b/mlir/include/mlir/Analysis/Presburger/PWMAFunction.h
index 236cc90ad66acd..6699f5469e4d16 100644
--- a/mlir/include/mlir/Analysis/Presburger/PWMAFunction.h
+++ b/mlir/include/mlir/Analysis/Presburger/PWMAFunction.h
@@ -117,7 +117,7 @@ class MultiAffineFunction {
   /// The space of this function. The domain variables are considered as the
   /// input variables of the function. The range variables are considered as
   /// the outputs. The symbols parametrize the function and locals are used to
-  /// represent divisions. Each local variable has a corressponding division
+  /// represent divisions. Each local variable has a corresponding division
   /// representation stored in `divs`.
   PresburgerSpace space;
 
@@ -239,7 +239,7 @@ class PWMAFunction {
 
   /// The space of this function. The domain variables are considered as the
   /// input variables of the function. The range variables are considered as
-  /// the outputs. The symbols paramterize the function.
+  /// the outputs. The symbols parameterize the function.
   PresburgerSpace space;
 
   // The pieces of the PWMAFunction.
diff --git a/mlir/include/mlir/Analysis/Presburger/PresburgerRelation.h b/mlir/include/mlir/Analysis/Presburger/PresburgerRelation.h
index c6b00eca90733a..8a64e1e5ac799c 100644
--- a/mlir/include/mlir/Analysis/Presburger/PresburgerRelation.h
+++ b/mlir/include/mlir/Analysis/Presburger/PresburgerRelation.h
@@ -250,7 +250,7 @@ class PresburgerSet : public PresburgerRelation {
   explicit PresburgerSet(const PresburgerRelation &set);
 
   /// These operations are the same as the ones in PresburgeRelation, they just
-  /// forward the arguement and return the result as a set instead of a
+  /// forward the argument and return the result as a set instead of a
   /// relation.
   PresburgerSet unionSet(const PresburgerRelation &set) const;
   PresburgerSet intersect(const PresburgerRelation &set) const;
diff --git a/mlir/include/mlir/Analysis/Presburger/Simplex.h b/mlir/include/mlir/Analysis/Presburger/Simplex.h
index 9482f69b31cd66..6cb961057e1999 100644
--- a/mlir/include/mlir/Analysis/Presburger/Simplex.h
+++ b/mlir/include/mlir/Analysis/Presburger/Simplex.h
@@ -348,7 +348,7 @@ class SimplexBase {
   SmallVector<UndoLogEntry, 8> undoLog;
 
   /// Holds a vector of bases. The ith saved basis is the basis that should be
-  /// restored when processing the ith occurrance of UndoLogEntry::RestoreBasis
+  /// restored when processing the ith occurrence of UndoLogEntry::RestoreBasis
   /// in undoLog. This is used by getSnapshotBasis.
   SmallVector<SmallVector<int, 8>, 8> savedBases;
 
@@ -371,7 +371,7 @@ class SimplexBase {
 ///
 /// This does not directly support negative-valued variables, so it uses the big
 /// M parameter trick to make all the variables non-negative. Basically we
-/// introduce an artifical variable M that is considered to have a value of
+/// introduce an artificial variable M that is considered to have a value of
 /// +infinity and instead of the variables x, y, z, we internally use variables
 /// M + x, M + y, M + z, which are now guaranteed to be non-negative. See the
 /// documentation for SimplexBase for more details. M is also considered to be
@@ -565,7 +565,7 @@ struct SymbolicLexOpt {
 /// negative for all values in the symbol domain, the row needs to be pivoted
 /// irrespective of the precise value of the symbols. To answer queries like
 /// "Is this symbolic sample always negative in the symbol domain?", we maintain
-/// a `LexSimplex domainSimplex` correponding to the symbol domain.
+/// a `LexSimplex domainSimplex` corresponding to the symbol domain.
 ///
 /// In other cases, it may be that the symbolic sample is violated at some
 /// values in the symbol domain and not violated at others. In this case,
@@ -756,7 +756,7 @@ class Simplex : public SimplexBase {
   /// the ones marked redundant because we scan from left to right. Thus, when
   /// there is some preference among the constraints as to which should be
   /// marked redundant with priority when there are multiple possibilities, this
-  /// could be accomplished by succesive calls to detectRedundant(offset,
+  /// could be accomplished by successive calls to detectRedundant(offset,
   /// count).
   void detectRedundant(unsigned offset, unsigned count);
   void detectRedundant(unsigned offset) {
diff --git a/mlir/include/mlir/Dialect/Affine/IR/AffineOps.h b/mlir/include/mlir/Dialect/Affine/IR/AffineOps.h
index f070d048861906..34ce59bf870f00 100644
--- a/mlir/include/mlir/Dialect/Affine/IR/AffineOps.h
+++ b/mlir/include/mlir/Dialect/Affine/IR/AffineOps.h
@@ -117,7 +117,8 @@ class AffineDmaStartOp
   /// Returns the affine map used to access the source memref.
   AffineMap getSrcMap() { return getSrcMapAttr().getValue(); }
   AffineMapAttr getSrcMapAttr() {
-    return cast<AffineMapAttr>(*(*this)->getInherentAttr(getSrcMapAttrStrName()));
+    return cast<AffineMapAttr>(
+        *(*this)->getInherentAttr(getSrcMapAttrStrName()));
   }
 
   /// Returns the source memref affine map indices for this DMA operation.
@@ -156,7 +157,8 @@ class AffineDmaStartOp
   /// Returns the affine map used to access the destination memref.
   AffineMap getDstMap() { return getDstMapAttr().getValue(); }
   AffineMapAttr getDstMapAttr() {
-    return cast<AffineMapAttr>(*(*this)->getInherentAttr(getDstMapAttrStrName()));
+    return cast<AffineMapAttr>(
+        *(*this)->getInherentAttr(getDstMapAttrStrName()));
   }
 
   /// Returns the destination memref indices for this DMA operation.
@@ -185,7 +187,8 @@ class AffineDmaStartOp
   /// Returns the affine map used to access the tag memref.
   AffineMap getTagMap() { return getTagMapAttr().getValue(); }
   AffineMapAttr getTagMapAttr() {
-    return cast<AffineMapAttr>(*(*this)->getInherentAttr(getTagMapAttrStrName()));
+    return cast<AffineMapAttr>(
+        *(*this)->getInherentAttr(getTagMapAttrStrName()));
   }
 
   /// Returns the tag memref indices for this DMA operation.
@@ -201,7 +204,7 @@ class AffineDmaStartOp
                       getTagMap().getNumInputs());
   }
 
-  /// Impelements the AffineMapAccessInterface.
+  /// Implements the AffineMapAccessInterface.
   /// Returns the AffineMapAttr associated with 'memref'.
   NamedAttribute getAffineMapAttrForMemRef(Value memref) {
     if (memref == getSrcMemRef())
@@ -307,7 +310,8 @@ class AffineDmaWaitOp
   /// Returns the affine map used to access the tag memref.
   AffineMap getTagMap() { return getTagMapAttr().getValue(); }
   AffineMapAttr getTagMapAttr() {
-    return cast<AffineMapAttr>(*(*this)->getInherentAttr(getTagMapAttrStrName()));
+    return cast<AffineMapAttr>(
+        *(*this)->getInherentAttr(getTagMapAttrStrName()));
   }
 
   /// Returns the tag memref index for this DMA operation.
@@ -321,7 +325,7 @@ class AffineDmaWaitOp
     return cast<MemRefType>(getTagMemRef().getType()).getRank();
   }
 
-  /// Impelements the AffineMapAccessInterface. Returns the AffineMapAttr
+  /// Implements the AffineMapAccessInterface. Returns the AffineMapAttr
   /// associated with 'memref'.
   NamedAttribute getAffineMapAttrForMemRef(Value memref) {
     assert(memref == getTagMemRef());
diff --git a/mlir/include/mlir/Dialect/Async/Passes.td b/mlir/include/mlir/Dialect/Async/Passes.td
index f0ef83ca3fd4f1..d4fbbf51907bc6 100644
--- a/mlir/include/mlir/Dialect/Async/Passes.td
+++ b/mlir/include/mlir/Dialect/Async/Passes.td
@@ -58,7 +58,7 @@ def AsyncFuncToAsyncRuntime : Pass<"async-func-to-async-runtime", "ModuleOp"> {
 def AsyncRuntimeRefCounting : Pass<"async-runtime-ref-counting"> {
   let summary = "Automatic reference counting for Async runtime operations";
   let description = [{
-    This pass works at the async runtime abtraction level, after all
+    This pass works at the async runtime abstraction level, after all
     `async.execute` and `async.await` operations are lowered to the async
     runtime API calls, and async coroutine operations.
 
@@ -83,7 +83,7 @@ def AsyncRuntimePolicyBasedRefCounting
     : Pass<"async-runtime-policy-based-ref-counting"> {
   let summary = "Policy based reference counting for Async runtime operations";
   let description = [{
-    This pass works at the async runtime abtraction level, after all
+    This pass works at the async runtime abstraction level, after all
     `async.execute` and `async.await` operations are lowered to the async
     runtime API calls, and async coroutine operations.
 
@@ -103,7 +103,7 @@ def AsyncRuntimePolicyBasedRefCounting
          (this is the last operation in the coroutine resume function).
       3. After `async.runtime.load` operation for async values.
 
-    This pass introduces significanly less runtime overhead compared to the
+    This pass introduces significantly less runtime overhead compared to the
     automatic reference counting.
   }];
 
diff --git a/mlir/include/mlir/Dialect/GPU/IR/CompilationInterfaces.h b/mlir/include/mlir/Dialect/GPU/IR/CompilationInterfaces.h
index 6d7cb5ca7a7f81..7a787ef7fcee3d 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/CompilationInterfaces.h
+++ b/mlir/include/mlir/Dialect/GPU/IR/CompilationInterfaces.h
@@ -85,7 +85,7 @@ class TargetOptions {
 
 protected:
   /// Derived classes must use this constructor to initialize `typeID` to the
-  /// appropiate value: ie. `TargetOptions(TypeID::get<DerivedClass>())`.
+  /// appropriate value: ie. `TargetOptions(TypeID::get<DerivedClass>())`.
   TargetOptions(
       TypeID typeID, StringRef toolkitPath = {},
       ArrayRef<std::string> linkFiles = {}, StringRef cmdOptions = {},
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
index 7b9d46fda12f51..7414f70c99cdf3 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUBase.td
@@ -102,7 +102,7 @@ def GPU_AsyncToken : DialectType<
   GPU_Dialect, CPred<"::llvm::isa<::mlir::gpu::AsyncTokenType>($_self)">, "async token type">,
              BuildableType<"mlir::gpu::AsyncTokenType::get($_builder.getContext())">;
 
-// Predicat to check if type is gpu::MMAMatrixType.
+// Predicate to check if type is gpu::MMAMatrixType.
 def IsMMAMatrixTypePred : CPred<"::llvm::isa<::mlir::gpu::MMAMatrixType>($_self)">;
 
 def GPU_MMAMatrix : DialectType<
@@ -139,7 +139,7 @@ def GPU_AsyncOpInterface : OpInterface<"AsyncOpInterface"> {
     GPU operations implementing this interface take a list of dependencies
     as `gpu.async.token` arguments and optionally return a `gpu.async.token`.
 
-    The op doesn't start executing until all depent ops producing the async
+    The op doesn't start executing until all dependent ops producing the async
     dependency tokens have finished executing.
 
     If the op returns a token, the op merely schedules the execution on the
diff --git a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
index 8d4a110ee801f0..4cf517502629a0 100644
--- a/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
+++ b/mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
@@ -1682,7 +1682,7 @@ def GPU_SubgroupMmaConstantMatrixOp : GPU_Op<"subgroup_mma_constant_matrix",
 
     The operation takes a scalar input and return a `!gpu.mma_matrix` where
     each element of is equal to the operand constant. The destination
-    mma_matrix type must have elememt type equal to the constant type. Since
+    mma_matrix type must have element type equal to the constant type. Since
     the layout of `!gpu.mma_matrix` is opaque this only support setting all the
     elements to the same value.
 
diff --git a/mlir/include/mlir/Dialect/GPU/TransformOps/GPUDeviceMappingAttr.td b/mlir/include/mlir/Dialect/GPU/TransformOps/GPUDeviceMappingAttr.td
index 6e0f6f1d78eda7..6503e2512e2f24 100644
--- a/mlir/include/mlir/Dialect/GPU/TransformOps/GPUDeviceMappingAttr.td
+++ b/mlir/include/mlir/Dialect/GPU/TransformOps/GPUDeviceMappingAttr.td
@@ -36,7 +36,7 @@ def LinearDim9 : I64EnumAttrCase<"LinearDim9", 12, "linear_dim_9">;
 // so we currently embed the 2 modes in the same enum.
 def MappingIdEnum : I64EnumAttr<"MappingId", "Mapping ids for loop mapping", [
     DimX, DimY, DimZ,
-    LinearDim0, LinearDim1, LinearDim2, LinearDim3, LinearDim4, 
+    LinearDim0, LinearDim1, LinearDim2, LinearDim3, LinearDim4,
     LinearDim5, LinearDim6, LinearDim7, LinearDim8, LinearDim9]> {
   let cppNamespace = "::mlir::gpu";
 }
@@ -50,7 +50,7 @@ def GPUBlockMappingAttr : GPU_Attr<"GPUBlockMapping", "block", [
   let description = [{
     An attribute that allows defining thread block parallelism for GPU devices.
 
-    Thread blocks (aka workgroup) are grouped into a grid described by a 
+    Thread blocks (aka workgroup) are grouped into a grid described by a
     3-dimensional rectangle.
     This attribute indicates that thread block parallelism is desired.
     It can be consumed by lowering to generate GPU code.
@@ -58,13 +58,13 @@ def GPUBlockMappingAttr : GPU_Attr<"GPUBlockMapping", "block", [
 
     #### 3D mapping mode
 
-    The 3D block id is simply the 3D index of the block `(bidx, bidy, bidz)`. 
-    If required, predication occurs on a per-dimension basis. This allows 
+    The 3D block id is simply the 3D index of the block `(bidx, bidy, bidz)`.
+    If required, predication occurs on a per-dimension basis. This allows
     specifying predication on a 3D sub-rectangle of the grid.
 
     #### Linear mapping mode
 
-    The linear block id is obtained by linearizing the index of the block. 
+    The linear block id is obtained by linearizing the index of the block.
     If required, predication occurs on the linear id. This allows specifying
     predication on a 1D subset of the (linearized) grid.
 
@@ -72,15 +72,15 @@ def GPUBlockMappingAttr : GPU_Attr<"GPUBlockMapping", "block", [
     denoted by (bx, by, bz), the block id is:
       `linear_id = bx + by * GX + bz * GX * GBY)`.
     The linear block id is fixed for the duration of a GPU kernel.
-    
+
     This linear id mapping attribute indicates a different linearization relation
-    is applied locally to a loop nest. 
-    
-    For instance, if the new basis is denoted as (LBD0, LBD1, LBD2, LBD3) the 
+    is applied locally to a loop nest.
+
+    For instance, if the new basis is denoted as (LBD0, LBD1, LBD2, LBD3) the
     block id in the new basis is:
-      ```(linear_id mod LBD0 , 
-          (linear_id / LBD0) mod * LBD1, 
-          (linear_id / (LBD0 * LBD1)) mod LBD2, 
+      ```(linear_id mod LBD0 ,
+          (linear_id / LBD0) mod * LBD1,
+          (linear_id / (LBD0 * LBD1)) mod LBD2,
           (linear_id / (LBD0 * LBD1 * LBD2)) mod LBD3)```.
     This reinterpretation is only fixed for the duration of a loop nest.
   }];
@@ -96,41 +96,41 @@ def GPUWarpgroupMappingAttr
   let description = [{
     An attribute that allows defining warpgroup parallelism for GPU devices.
 
-    Threads of proper granularity (e.g. multiple of 
+    Threads of proper granularity (e.g. multiple of
     "kNumWarpsPerGroup * kWarpSize" on CUDA devices) can be grouped into
-    warpgroups described by a 3-dimensional rectangle. 
-    This attribute indicates that warpgroup parallelism is desired. 
+    warpgroups described by a 3-dimensional rectangle.
+    This attribute indicates that warpgroup parallelism is desired.
     It can be consumed by lowering to generate GPU code.
     2 modes are supported: (1) 3D mapping mode and (2) linear mapping mode.
 
     #### 3D mapping mode
 
-    The 3D warpgroup id is simply the adjusted 3D index of the thread 
+    The 3D warpgroup id is simply the adjusted 3D index of the thread
     `(tidx / (kNumWarpsPerGroup * kWarpSize), tidy, tidz)`.
-    If required, predication occurs on a per-dimension basis. This allows 
+    If required, predication occurs on a per-dimension basis. This allows
     specifying predication on a 3D sub-rectangle of the warpgroups.
 
     #### Linear mapping mode
 
     The linear warpgroup id is obtained by linearizing the index of the warpgroup.
     If required, predication occurs on the linear id. This allows specifying
-    predication on a 1D "kNumWarpsPerGroup * kWarpSize"-aligned subset of the 
+    predication on a 1D "kNumWarpsPerGroup * kWarpSize"-aligned subset of the
     (linearized) block.
 
     For instance, if the basis is denoted as (BX, BY, BZ) and the thread id is
     id is denoted by (tx, ty, tz), the linear warpgroup id is:
-      ```linear_id = (tx + ty * BX + tz * BX * BY) 
+      ```linear_id = (tx + ty * BX + tz * BX * BY)
                  / (kNumWarpsPerGroup * kWarpSize)```.
     The linear warpgroup id is fixed for the duration of a GPU kernel.
-    
+
     This linear id mapping attribute indicates a different linearization relation
-    is applied locally to a loop nest. 
-    
-    For instance, if the new basis is denoted as (LWGD0, LWGD1, LWGD2, LWGD3) the 
+    is applied locally to a loop nest.
+
+    For instance, if the new basis is denoted as (LWGD0, LWGD1, LWGD2, LWGD3) the
     warpgroup id in the new basis is:
-      ```(linear_id mod LWGD0 , 
-          (linear_id / LWGD0) mod * LWGD1, 
-          (linear_id / (LWGD0 * LWGD1)) mod LWGD2, 
+      ```(linear_id mod LWGD0 ,
+          (linear_id / LWGD0) mod * LWGD1,
+          (linear_id / (LWGD0 * LWGD1)) mod LWGD2,
           (linear_id / (LWGD0 * LWGD1 * LWGD2)) mod LWGD3)```.
     This reinterpretation is only fixed for the duration of a loop nest.
   }];
@@ -146,17 +146,17 @@ def GPUWarpMappingAttr
   let description = [{
     An attribute that allows defining warp parallelism for GPU devices.
 
-    Threads of proper granularity (e.g. multiple of "warp size" on CUDA devices) 
-    can be grouped into warps described by a 3-dimensional rectangle. 
+    Threads of proper granularity (e.g. multiple of "warp size" on CUDA devices)
+    can be grouped into warps described by a 3-dimensional rectangle.
     This attribute indicates that warp parallelism is desired.
     It can be consumed by lowering to generate GPU code.
     2 modes are supported: (1) 3D mapping mode and (2) linear mapping mode.
 
     #### 3D mapping mode
 
-    The 3D warp id is simply the adjusted 3D index of the thread 
+    The 3D warp id is simply the adjusted 3D index of the thread
     `(tidx / kWarpSize, tidy, tidz)`.
-    If required, predication occurs on a per-dimension basis. This allows 
+    If required, predication occurs on a per-dimension basis. This allows
     specifying predication on a 3D sub-rectangle of the warpgroups.
 
     #### Linear mapping mode
@@ -169,15 +169,15 @@ def GPUWarpMappingAttr
     id is denoted by (tx, ty, tz), the linear warp id is:
       `linear_id = (tx + ty * BX + tz * BX * BY) / kWarpSize`.
     The linear warp id is fixed for the duration of a GPU kernel.
-    
+
     This linear id mapping attribute indicates a different linearization relation
-    is applied locally to a loop nest. 
-    
-    For instance, if the new basis is denoted as (LWD0, LWD1, LWD2, LWD3) the 
+    is applied locally to a loop nest.
+
+    For instance, if the new basis is denoted as (LWD0, LWD1, LWD2, LWD3) the
     warp id in the new basis is:
-      ```(linear_id mod LWD0 , 
-          (linear_id / LWD0) mod * LWD1, 
-          (linear_id / (LWD0 * LWD1)) mod LWD2, 
+      ```(linear_id mod LWD0 ,
+          (linear_id / LWD0) mod * LWD1,
+          (linear_id / (LWD0 * LWD1)) mod LWD2,
           (linear_id / (LWD0 * LWD1 * LWD2)) mod LWD3)```.
     This reinterpretation is only fixed for the duration of a loop nest.
   }];
@@ -193,20 +193,20 @@ def GPUThreadMappingAttr
   let description = [{
     An attribute that allows defining thread parallelism for GPU devices.
 
-    Thread (aka work item) are grouped into a thread blocks described by a 
+    Thread (aka work item) are grouped into a thread blocks described by a
     3-dimensional rectangle.
     This attribute indicates that thread parallelism is desired.
     It can be consumed by lowering to generate GPU.
 
     #### 3D mapping mode
 
-    The 3D thread id is simply the 3D index of the thread `(tidx, tidy, tidz)`. 
-    If required, predication occurs on a per-dimension basis. This allows 
+    The 3D thread id is simply the 3D index of the thread `(tidx, tidy, tidz)`.
+    If required, predication occurs on a per-dimension basis. This allows
     specifying predication on a 3D sub-rectangle of the block.
 
     #### Linear mapping mode
 
-    The linear thread id is obtained by linearizing the index of the thread. 
+    The linear thread id is obtained by linearizing the index of the thread.
     If required, predication occurs on the linear id. This allows specifying
     predication on a 1D subset of the (linearized) block.
 
@@ -214,15 +214,15 @@ def GPUThreadMappingAttr
     id is denoted by (tx, ty, tz), the linear thread id is:
       ```linear_id = (tx + ty * BX + tz * BX * BY)```.
     The linear thread id is fixed for the duration of a GPU kernel.
-    
+
     This linear id mapping attribute indicates a different linearization relation
-    is applied locally to a loop nest. 
-    
-    For instance, if the new basis is denoted as (LTD0, LTD1, LTD2, LTD3) the 
+    is applied locally to a loop nest.
+
+    For instance, if the new basis is denoted as (LTD0, LTD1, LTD2, LTD3) the
     thread id in the new basis is:
-      ```(linear_id mod LTD0 , 
-          (linear_id / LTD0) mod * LTD1, 
-          (linear_id / (LTD0 * LTD1)) mod LTD2, 
+      ```(linear_id mod LTD0 ,
+          (linear_id / LTD0) mod * LTD1,
+          (linear_id / (LTD0 * LTD1)) mod LTD2,
           (linear_id / (LTD0 * LTD1 * LTD2)) mod LTD3)```.
     This reinterpretation is only fixed for the duration of a loop nest.
   }];
@@ -240,7 +240,7 @@ def GPUMemorySpaceMappingAttr : GPU_Attr<"GPUMemorySpaceMapping", "memory_space"
     GPU Memory has three memory space, global, workgroup, and private. The global memory
     is visible to all workitems and workgroups, the workgroup memory is only available for workitems
     within a workgroup, and private memory is only visible to a single workitem. This attribute indicates
-    that using memory hiearchy is desired. It can be consumed by lowering to
+    that using memory hierarchy is desired. It can be consumed by lowering to
     move data to a specific address space in GPU code.
   }];
 }
diff --git a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
index 1941c4dece1b86..d2f2c6e49a2f96 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
@@ -72,8 +72,8 @@ class NVVM_Op<string mnemonic, list<Trait> traits = []> :
   LLVM_OpBase<NVVM_Dialect, mnemonic, traits> {
 }
 
-/// Base class that defines BasicPtxBuilderOpInterface. 
-class NVVM_PTXBuilder_Op<string mnemonic, 
+/// Base class that defines BasicPtxBuilderOpInterface.
+class NVVM_PTXBuilder_Op<string mnemonic,
   list<Trait> traits = [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]> :
   LLVM_OpBase<NVVM_Dialect, mnemonic, traits> {
 }
@@ -183,11 +183,11 @@ def ReduxKindMin  : I32EnumAttrCase<"MIN", 4, "min">;
 def ReduxKindOr   : I32EnumAttrCase<"OR", 5, "or">;
 def ReduxKindUmax : I32EnumAttrCase<"UMAX", 6, "umax">;
 def ReduxKindUmin : I32EnumAttrCase<"UMIN", 7, "umin">;
-def ReduxKindXor  : I32EnumAttrCase<"XOR", 8, "xor">; 
+def ReduxKindXor  : I32EnumAttrCase<"XOR", 8, "xor">;
 
 /// Enum attribute of the different kinds.
 def ReduxKind : I32EnumAttr<"ReduxKind", "NVVM redux kind",
-  [ReduxKindAdd, ReduxKindAnd, ReduxKindMax, ReduxKindMin, ReduxKindOr, 
+  [ReduxKindAdd, ReduxKindAnd, ReduxKindMax, ReduxKindMin, ReduxKindOr,
     ReduxKindUmax, ReduxKindUmin, ReduxKindXor]> {
   let genSpecializedAttr = 0;
   let cppNamespace = "::mlir::NVVM";
@@ -207,7 +207,7 @@ def NVVM_ReduxOp :
   }];
   let assemblyFormat = [{
     $kind $val `,` $mask_and_clamp  attr-dict `:` type($val) `->` type($res)
-   }];   
+   }];
 }
 
 //===----------------------------------------------------------------------===//
@@ -294,7 +294,7 @@ def NVVM_MBarrierArriveNocompleteSharedOp : NVVM_Op<"mbarrier.arrive.nocomplete.
   let assemblyFormat = "$addr `,` $count attr-dict `:` type(operands) `->` type($res)";
 }
 
-def NVVM_MBarrierArriveExpectTxOp : NVVM_PTXBuilder_Op<"mbarrier.arrive.expect_tx">,  
+def NVVM_MBarrierArriveExpectTxOp : NVVM_PTXBuilder_Op<"mbarrier.arrive.expect_tx">,
   Arguments<(ins LLVM_AnyPointer:$addr, I32:$txcount, PtxPredicate:$predicate)> {
   let assemblyFormat = "$addr `,` $txcount (`,` `predicate` `=` $predicate^)? attr-dict `:` type(operands)";
   let extraClassDefinition = [{
@@ -302,16 +302,16 @@ def NVVM_MBarrierArriveExpectTxOp : NVVM_PTXBuilder_Op<"mbarrier.arrive.expect_t
   }];
 }
 
-def NVVM_MBarrierArriveExpectTxSharedOp : NVVM_PTXBuilder_Op<"mbarrier.arrive.expect_tx.shared">,  
-  Arguments<(ins LLVM_PointerShared:$addr, I32:$txcount, PtxPredicate:$predicate)> {    
+def NVVM_MBarrierArriveExpectTxSharedOp : NVVM_PTXBuilder_Op<"mbarrier.arrive.expect_tx.shared">,
+  Arguments<(ins LLVM_PointerShared:$addr, I32:$txcount, PtxPredicate:$predicate)> {
   let assemblyFormat = "$addr `,` $txcount (`,` `predicate` `=` $predicate^)? attr-dict `:` type(operands)";
   let extraClassDefinition = [{
     std::string $cppClass::getPtx() { return std::string("mbarrier.arrive.expect_tx.shared.b64 _, [%0], %1;"); }
   }];
 }
 
-def NVVM_MBarrierTryWaitParityOp : NVVM_PTXBuilder_Op<"mbarrier.try_wait.parity">,  
-  Arguments<(ins LLVM_AnyPointer:$addr, I32:$phase, I32:$ticks)> {  
+def NVVM_MBarrierTryWaitParityOp : NVVM_PTXBuilder_Op<"mbarrier.try_wait.parity">,
+  Arguments<(ins LLVM_AnyPointer:$addr, I32:$phase, I32:$ticks)> {
   let assemblyFormat = "$addr `,` $phase `,` $ticks attr-dict `:` type(operands)";
   let extraClassDefinition = [{
     std::string $cppClass::getPtx() {
@@ -324,13 +324,13 @@ def NVVM_MBarrierTryWaitParityOp : NVVM_PTXBuilder_Op<"mbarrier.try_wait.parity"
         "bra.uni     LAB_WAIT; \n\t"
         "DONE: \n\t"
         "}"
-      ); 
+      );
     }
   }];
 }
 
-def NVVM_MBarrierTryWaitParitySharedOp : NVVM_PTXBuilder_Op<"mbarrier.try_wait.parity.shared">,  
-  Arguments<(ins LLVM_PointerShared:$addr, I32:$phase, I32:$ticks)> {  
+def NVVM_MBarrierTryWaitParitySharedOp : NVVM_PTXBuilder_Op<"mbarrier.try_wait.parity.shared">,
+  Arguments<(ins LLVM_PointerShared:$addr, I32:$phase, I32:$ticks)> {
   let assemblyFormat = "$addr `,` $phase `,` $ticks attr-dict `:` type(operands)";
   let extraClassDefinition = [{
     std::string $cppClass::getPtx() {
@@ -343,7 +343,7 @@ def NVVM_MBarrierTryWaitParitySharedOp : NVVM_PTXBuilder_Op<"mbarrier.try_wait.p
         "bra.uni     LAB_WAIT; \n\t"
         "DONE: \n\t"
         "}"
-      ); 
+      );
     }
   }];
 }
@@ -439,7 +439,7 @@ def NVVM_FenceProxyOp : NVVM_PTXBuilder_Op<"fence.proxy">,
     [For more information, see PTX ISA]
     (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-membar)
   }];
-  
+
   let assemblyFormat = "attr-dict";
   let extraClassDefinition = [{
     std::string $cppClass::getPtx() {
@@ -483,9 +483,9 @@ def NVVM_FenceMbarrierInitOp : NVVM_PTXBuilder_Op<"fence.mbarrier.init"> {
     [For more information, see PTX ISA]
     (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions-membar)
   }];
-  
+
   let assemblyFormat = "attr-dict";
-  let extraClassDefinition = [{        
+  let extraClassDefinition = [{
     std::string $cppClass::getPtx() {
       return std::string("fence.mbarrier_init.release.cluster;");
     }
@@ -548,13 +548,13 @@ def NVVM_SyncWarpOp :
 }
 
 
-def NVVM_ElectSyncOp : NVVM_Op<"elect.sync", 
+def NVVM_ElectSyncOp : NVVM_Op<"elect.sync",
                   [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>]>
-{  
+{
   let results = (outs I1:$pred);
-  let assemblyFormat = "attr-dict `->` type(results)";  
-  let extraClassDefinition = [{        
-    std::string $cppClass::getPtx() { 
+  let assemblyFormat = "attr-dict `->` type(results)";
+  let extraClassDefinition = [{
+    std::string $cppClass::getPtx() {
       return std::string(
         "{                                  \n"
         ".reg .u32 rx;                      \n"
@@ -563,7 +563,7 @@ def NVVM_ElectSyncOp : NVVM_Op<"elect.sync",
         "    elect.sync rx | px, 0xFFFFFFFF;\n"
         "@px mov.pred %0, 1;                \n"
         "}\n"
-      ); 
+      );
     }
   }];
 }
@@ -575,16 +575,16 @@ def LoadCacheModifierLU : I32EnumAttrCase<"LU", 3, "lu">;
 def LoadCacheModifierCV : I32EnumAttrCase<"CV", 4, "cv">;
 
 /// Enum attribute of the different kinds.
-def LoadCacheModifierKind : I32EnumAttr<"LoadCacheModifierKind", 
+def LoadCacheModifierKind : I32EnumAttr<"LoadCacheModifierKind",
                                 "NVVM load cache modifier kind",
-  [LoadCacheModifierCA, LoadCacheModifierCG, LoadCacheModifierCS, 
+  [LoadCacheModifierCA, LoadCacheModifierCG, LoadCacheModifierCS,
     LoadCacheModifierLU, LoadCacheModifierCV]> {
   let genSpecializedAttr = 0;
   let cppNamespace = "::mlir::NVVM";
   let description = [{
     Enum attribute of the different kinds of cache operators for load instructions.
 
-    [For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#id62)    
+    [For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#id62)
   }];
 }
 
@@ -610,7 +610,7 @@ def NVVM_CpAsyncOp : NVVM_PTXBuilder_Op<"cp.async.shared.global">,
             id = llvm::Intrinsic::nvvm_cp_async_cg_shared_global_16;
           else if($modifier == NVVM::LoadCacheModifierKind::CA)
             id = llvm::Intrinsic::nvvm_cp_async_ca_shared_global_16;
-          else 
+          else
             llvm_unreachable("unsupported cache modifier");
           break;
         default:
@@ -623,21 +623,21 @@ def NVVM_CpAsyncOp : NVVM_PTXBuilder_Op<"cp.async.shared.global">,
   let extraClassDeclaration = [{
     bool hasIntrinsic() { if(getCpSize()) return false; return true; }
 
-    void getAsmValues(RewriterBase &rewriter, 
+    void getAsmValues(RewriterBase &rewriter,
         llvm::SmallVectorImpl<std::pair<mlir::Value, mlir::NVVM::PTXRegisterMod>> &asmValues) {
       asmValues.push_back({getDst(), PTXRegisterMod::Read});
       asmValues.push_back({getSrc(), PTXRegisterMod::Read});
       asmValues.push_back({makeConstantI32(rewriter, getSize()), PTXRegisterMod::Read});
       asmValues.push_back({getCpSize(), PTXRegisterMod::Read});
-    }        
+    }
   }];
-  let extraClassDefinition = [{        
-    std::string $cppClass::getPtx() { 
+  let extraClassDefinition = [{
+    std::string $cppClass::getPtx() {
       if(getModifier() == NVVM::LoadCacheModifierKind::CG)
         return std::string("cp.async.cg.shared.global [%0], [%1], %2, %3;\n");
       if(getModifier() == NVVM::LoadCacheModifierKind::CA)
         return std::string("cp.async.ca.shared.global [%0], [%1], %2, %3;\n");
-      llvm_unreachable("unsupported cache modifier");      
+      llvm_unreachable("unsupported cache modifier");
     }
   }];
 }
@@ -1143,8 +1143,8 @@ def NVVM_WMMALoadOp: NVVM_Op<"wmma.load">,
   let summary = "Warp synchronous matrix load";
 
   // Since LLVM intrinsic IDs are enum that cannot be dynamically generated in
-  // C++ we instanciate a function in tablegen to map the valide configuration
-  // to the corresponsding intrinsic ID.
+  // C++ we instantiate a function in tablegen to map the valid configuration
+  // to the corresponding intrinsic ID.
   // Because we want a single source of truth, this mean the source of truth
   // about valid combinations needs to be in tablgen, therefore we generate
   // extra helpers to query valid configurations based on the shapes of
@@ -1325,9 +1325,9 @@ def NVVM_WMMAMmaOp : NVVM_Op<"wmma.mma">,
   let hasVerifier = 1;
 }
 
-def NVVM_StMatrixOp: NVVM_PTXBuilder_Op<"stmatrix">, 
-  Arguments<(ins LLVM_PointerShared:$ptr, 
-                 Variadic<I32>:$sources, 
+def NVVM_StMatrixOp: NVVM_PTXBuilder_Op<"stmatrix">,
+  Arguments<(ins LLVM_PointerShared:$ptr,
+                 Variadic<I32>:$sources,
                  MMALayoutAttr:$layout)> {
   let summary = "cooperative matrix store";
   let description = [{
@@ -1336,7 +1336,7 @@ def NVVM_StMatrixOp: NVVM_PTXBuilder_Op<"stmatrix">,
     [For more information, see PTX ISA]
     (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-store-instruction-stmatrix)
   }];
-  
+
   let assemblyFormat = "$ptr `,` $sources attr-dict `:` type(operands)";
   let extraClassDefinition = [{
     std::string $cppClass::getPtx() {
@@ -1409,7 +1409,7 @@ def NVVM_MmaOp : NVVM_Op<"mma.sync", [AttrSizedOperandSegments]> {
     All the threads in the warp must execute the same `mma.sync` operation.
 
     For each possible multiplicand PTX data type, there are one or more possible
-    instruction shapes given as "mMnNkK". The below table describes the posssibilities
+    instruction shapes given as "mMnNkK". The below table describes the possibilities
     as well as the types required for the operands. Note that the data type for
     C (the accumulator) and D (the result) can vary independently when there are
     multiple possibilities in the "C/D Type" column.
@@ -1548,53 +1548,53 @@ def NVVM_CpAsyncBulkCommitGroupOp : NVVM_PTXBuilder_Op<"cp.async.bulk.commit.gro
 }
 
 
-def NVVM_CpAsyncBulkTensorGlobalToSharedClusterOp : 
-  NVVM_Op<"cp.async.bulk.tensor.shared.cluster.global", 
-  [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>, 
+def NVVM_CpAsyncBulkTensorGlobalToSharedClusterOp :
+  NVVM_Op<"cp.async.bulk.tensor.shared.cluster.global",
+  [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>,
   AttrSizedOperandSegments]>,
   Arguments<(ins  LLVM_PointerShared:$dstMem,
                   LLVM_AnyPointer:$tmaDescriptor,
                   Variadic<I32>:$coordinates,
-                  LLVM_PointerShared:$mbar,                  
+                  LLVM_PointerShared:$mbar,
                   Variadic<I16>:$im2colOffsets,
                   Optional<I16>:$multicastMask,
                   Optional<I64>:$l2CacheHint,
                   PtxPredicate:$predicate)> {
   let description = [{
-    Initiates an asynchronous copy operation on the tensor data from global 
-    memory to shared memory. 
+    Initiates an asynchronous copy operation on the tensor data from global
+    memory to shared memory.
 
     The Op operates has two load modes:
-    1) Tiled Mode: It's the default mode. The source multi-dimensional tensor 
-    layout is preserved at the destination. 
+    1) Tiled Mode: It's the default mode. The source multi-dimensional tensor
+    layout is preserved at the destination.
 
     2) Im2col Mode: This mode is used when `im2colOffsets` operands are present.
     the elements in the Bounding Box of the source tensor are rearranged into
-    columns at the destination. In this mode, the tensor has to be at least 
-    3-dimensional. 
+    columns at the destination. In this mode, the tensor has to be at least
+    3-dimensional.
 
     The `multicastMask` operand is optional. When it is present, the Op copies
     data from global memory to shared memory of multiple CTAs in the cluster.
-    Operand `multicastMask` specifies the destination CTAs in the cluster such 
+    Operand `multicastMask` specifies the destination CTAs in the cluster such
     that each bit position in the 16-bit `multicastMask` operand corresponds to
-    the `nvvm.read.ptx.sreg.ctaid` of the destination CTA.     
+    the `nvvm.read.ptx.sreg.ctaid` of the destination CTA.
 
-    The `l2CacheHint` operand is optional, and it is used to specify cache 
+    The `l2CacheHint` operand is optional, and it is used to specify cache
     eviction policy that may be used during the memory access.
-    
+
     [For more information, see PTX ISA]
     (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor)
   }];
 
-  let assemblyFormat = [{ 
-    $dstMem `,` 
-    $tmaDescriptor `,` 
-    $mbar `,` 
-    `box` `[`$coordinates `]` 
+  let assemblyFormat = [{
+    $dstMem `,`
+    $tmaDescriptor `,`
+    $mbar `,`
+    `box` `[`$coordinates `]`
     (`im2col` `[` $im2colOffsets^ `]` )?
     (`multicast_mask` `=` $multicastMask^ )?
     (`l2_cache_hint` `=` $l2CacheHint^ )?
-    (`predicate` `=` $predicate^)? 
+    (`predicate` `=` $predicate^)?
     attr-dict  `:` type($dstMem) `,` type($tmaDescriptor)
   }];
 
@@ -1604,16 +1604,16 @@ def NVVM_CpAsyncBulkTensorGlobalToSharedClusterOp :
       int dim = getCoordinates().size();
       std::string ptx = "cp.async.bulk.tensor.";
       ptx += std::to_string(dim) + "d.";
-      ptx += "shared::cluster.global.mbarrier::complete_tx::bytes";      
+      ptx += "shared::cluster.global.mbarrier::complete_tx::bytes";
       if(im2colDim) ptx += ".im2col";
-      if(getMulticastMask()) ptx += ".multicast::cluster";      
+      if(getMulticastMask()) ptx += ".multicast::cluster";
       if(getL2CacheHint()) ptx += ".L2::cache_hint";
-      
+
       auto preg = [](int r) { return "%" + std::to_string(r); };
 
       // Build Registers
       ptx += " [%0], [%1, {";
-      int r = 2;      
+      int r = 2;
       for(int i = 0; i < dim; i++) ptx += preg(r+i) + ",";
       ptx.pop_back(); r += dim;
       ptx += "} ], [%" + std::to_string(r++) + "]";
@@ -1632,19 +1632,19 @@ def NVVM_CpAsyncBulkTensorGlobalToSharedClusterOp :
   let hasVerifier = 1;
 }
 
-def NVVM_CpAsyncBulkTensorSharedCTAToGlobalOp : 
-  NVVM_Op<"cp.async.bulk.tensor.global.shared.cta", 
-  [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>, 
+def NVVM_CpAsyncBulkTensorSharedCTAToGlobalOp :
+  NVVM_Op<"cp.async.bulk.tensor.global.shared.cta",
+  [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>,
   AttrSizedOperandSegments]>,
   Arguments<(ins  LLVM_AnyPointer:$tmaDescriptor,
                   LLVM_PointerShared:$srcMem,
                   Variadic<I32>:$coordinates,
                   PtxPredicate:$predicate)> {
-  let assemblyFormat = [{ 
-    $tmaDescriptor `,` 
-    $srcMem `,` 
-    `box` `[`$coordinates `]` 
-    (`,` `predicate` `=` $predicate^)?  
+  let assemblyFormat = [{
+    $tmaDescriptor `,`
+    $srcMem `,`
+    `box` `[`$coordinates `]`
+    (`,` `predicate` `=` $predicate^)?
     attr-dict  `:` type(operands)
   }];
   let extraClassDefinition = [{
@@ -1669,7 +1669,7 @@ def NVVM_PrefetchTensorMapOp : NVVM_Op<"prefetch.tensormap",
   Arguments<(ins LLVM_AnyPointer:$tmaDescriptor, PtxPredicate:$predicate)> {
   let assemblyFormat = "$tmaDescriptor (`,` `predicate` `=` $predicate^)? attr-dict `:` type(operands)";
   let extraClassDefinition = [{
-    std::string $cppClass::getPtx() { 
+    std::string $cppClass::getPtx() {
       return std::string("prefetch.tensormap [%0];");
     }
   }];
@@ -1682,9 +1682,9 @@ def NVVM_PrefetchTensorMapOp : NVVM_Op<"prefetch.tensormap",
 def NVVM_WgmmaFenceAlignedOp : NVVM_PTXBuilder_Op<"wgmma.fence.aligned"> {
   let arguments = (ins);
   let description = [{
-    Enforce an ordering of register accesses between warpgroup level matrix 
-    multiplication and other operations. 
-    
+    Enforce an ordering of register accesses between warpgroup level matrix
+    multiplication and other operations.
+
     [For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions-wgmma-fence)
   }];
   let assemblyFormat = "attr-dict";
@@ -1698,7 +1698,7 @@ def NVVM_WgmmaGroupSyncAlignedOp : NVVM_PTXBuilder_Op<"wgmma.commit.group.sync.a
   let assemblyFormat = "attr-dict";
   let description = [{
     Commits all prior uncommitted warpgroup level matrix multiplication operations.
-    
+
     [For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions-wgmma-commit-group)
   }];
   let extraClassDefinition = [{
@@ -1711,7 +1711,7 @@ def NVVM_WgmmaWaitGroupSyncOp : NVVM_PTXBuilder_Op<"wgmma.wait.group.sync.aligne
   let assemblyFormat = "attr-dict $group";
   let description = [{
     Signal the completion of a preceding warpgroup operation.
-    
+
     [For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions-wgmma-wait-group)
   }];
   let extraClassDefinition = [{
@@ -1755,7 +1755,7 @@ def WGMMATypeF8E5M2 : I32EnumAttrCase<"e5m2", 7>;
 def WGMMATypes : I32EnumAttr<"WGMMATypes", "NVVM WGMMA types",
   [WGMMATypeF16, WGMMATypeTF32,
     WGMMATypeU8, WGMMATypeS8,
-    WGMMATypeB1, WGMMATypeBF16, WGMMATypeF8E4M3, 
+    WGMMATypeB1, WGMMATypeBF16, WGMMATypeF8E4M3,
     WGMMATypeF8E5M2]> {
   let genSpecializedAttr = 0;
   let cppNamespace = "::mlir::NVVM";
@@ -1765,43 +1765,43 @@ def WGMMATypesAttr : EnumAttr<NVVM_Dialect, WGMMATypes, "wgmma_type"> {
 }
 
 
-def NVVM_WgmmaMmaAsyncOp : NVVM_Op<"wgmma.mma_async", 
+def NVVM_WgmmaMmaAsyncOp : NVVM_Op<"wgmma.mma_async",
               [DeclareOpInterfaceMethods<BasicPtxBuilderOpInterface>,
                 PredOpTrait<"input struct and result struct must be the same type",
-                  TCresIsSameAsOpBase<0, 0>>,]> 
+                  TCresIsSameAsOpBase<0, 0>>,]>
 {
   let results = (outs LLVM_AnyStruct:$results);
-  let arguments = (ins 
+  let arguments = (ins
     LLVM_AnyStruct:$inouts,
-    I64:$descriptorA, 
-    I64:$descriptorB, 
+    I64:$descriptorA,
+    I64:$descriptorB,
     NVVM_MMAShapeAttr:$shape,
     WGMMATypesAttr:$typeA,
     WGMMATypesAttr:$typeB,
     WGMMAScaleOutAttr:$scaleD,
     WGMMAScaleInAttr:$scaleA,
-    WGMMAScaleInAttr:$scaleB, 
+    WGMMAScaleInAttr:$scaleB,
     MMALayoutAttr:$layoutA,
     MMALayoutAttr:$layoutB,
     OptionalAttr<MMAIntOverflowAttr>:$satfinite
-  );  
-  
-   let assemblyFormat = [{ 
-      $descriptorA `,` $descriptorB `,` $shape `,` 
+  );
+
+   let assemblyFormat = [{
+      $descriptorA `,` $descriptorB `,` $shape `,`
       `D` `[` $inouts `,` $scaleD (`,` $satfinite^)? `]` `,`
-      `A` `[` $typeA `,` $scaleA `,` $layoutA `]` `,` 
+      `A` `[` $typeA `,` $scaleA `,` $layoutA `]` `,`
       `B` `[` $typeB `,` $scaleB `,` $layoutB `]`
-      attr-dict `:` 
+      attr-dict `:`
       type($inouts) `->` type($results)
     }];
-  
+
   let description = [{
-    The warpgroup (128 threads) level matrix multiply and accumulate operation 
+    The warpgroup (128 threads) level matrix multiply and accumulate operation
     has either of the following forms, where matrix D is called accumulator:
       D = A * B + D
       D = A * B, where the input from accumulator D is disabled.
 
-    Supported shapes:  
+    Supported shapes:
     ```
     |--------------|--------------|------------|--------------|---------------|
     |              |              |            |              |f16+=e4m3*e4m3 |
@@ -1849,14 +1849,14 @@ def NVVM_WgmmaMmaAsyncOp : NVVM_Op<"wgmma.mma_async",
     |--------------|--------------|------------|--------------|---------------|
     ```
 
-    
+
     [For more information, see PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-instructions)
   }];
-  
+
   let hasVerifier = 1;
 
   let extraClassDeclaration = [{
-    void getAsmValues(RewriterBase &rewriter, 
+    void getAsmValues(RewriterBase &rewriter,
         llvm::SmallVectorImpl<std::pair<mlir::Value, mlir::NVVM::PTXRegisterMod>> &asmValues);
   }];
 }
diff --git a/mlir/include/mlir/Dialect/LLVMIR/Transforms/Passes.td b/mlir/include/mlir/Dialect/LLVMIR/Transforms/Passes.td
index 0242cfd9abb7d5..4aac81b169a8cf 100644
--- a/mlir/include/mlir/Dialect/LLVMIR/Transforms/Passes.td
+++ b/mlir/include/mlir/Dialect/LLVMIR/Transforms/Passes.td
@@ -73,8 +73,8 @@ def DIScopeForLLVMFuncOp : Pass<"ensure-debug-info-scope-on-llvm-func", "::mlir:
     emitting line tables from MLIR FileLocCol locations.
 
     This is not intended to be a proper replacement for frontends to emit
-    complete debug informations, however it is a convenient way to get line
-    tables for debugging purposes. This allow to step trough in a debugger
+    complete debug information, however it is a convenient way to get line
+    tables for debugging purposes. This allow to step through in a debugger
     line-by-line or get a backtrace with line numbers.
   }];
 
diff --git a/mlir/include/mlir/Dialect/Math/IR/MathOps.td b/mlir/include/mlir/Dialect/Math/IR/MathOps.td
index 3f6d2d2e44783f..1b5b61afbb88d3 100644
--- a/mlir/include/mlir/Dialect/Math/IR/MathOps.td
+++ b/mlir/include/mlir/Dialect/Math/IR/MathOps.td
@@ -276,7 +276,7 @@ def Math_Atan2Op : Math_FloatBinaryOp<"atan2">{
     (i.e., scalar, tensor or vector).
 
     The 2-argument arcus tangent `atan2(y, x)` returns the angle in the
-    Euclidian plane between the positive x-axis and the ray through the point
+    Euclidean plane between the positive x-axis and the ray through the point
     (x, y).  It is a generalization of the 1-argument arcus tangent which
     returns the angle on the basis of the ratio y/x.
 
diff --git a/mlir/include/mlir/Dialect/MemRef/IR/MemRefOps.td b/mlir/include/mlir/Dialect/MemRef/IR/MemRefOps.td
index c71517666b609c..1a82a7e7a359c6 100644
--- a/mlir/include/mlir/Dialect/MemRef/IR/MemRefOps.td
+++ b/mlir/include/mlir/Dialect/MemRef/IR/MemRefOps.td
@@ -245,7 +245,7 @@ def MemRef_ReallocOp : MemRef_Op<"realloc"> {
     ```
 
     If the result memref has a dynamic shape, a result dimension operand is
-    needed to spefify its dynamic dimension. In the example below, the ssa value
+    needed to specify its dynamic dimension. In the example below, the ssa value
     '%d' specifies the unknown dimension of the result memref.
 
     ```mlir
@@ -641,7 +641,7 @@ def MemRef_DmaStartOp : MemRef_Op<"dma_start"> {
   let summary = "non-blocking DMA operation that starts a transfer";
   let description = [{
     Syntax:
-    
+
     ```
     operation ::= `memref.dma_start` ssa-use`[`ssa-use-list`]` `,`
                    ssa-use`[`ssa-use-list`]` `,` ssa-use `,`
diff --git a/mlir/include/mlir/Dialect/MemRef/Transforms/Transforms.h b/mlir/include/mlir/Dialect/MemRef/Transforms/Transforms.h
index a918f62cbc8db8..1efbbf652ba8f9 100644
--- a/mlir/include/mlir/Dialect/MemRef/Transforms/Transforms.h
+++ b/mlir/include/mlir/Dialect/MemRef/Transforms/Transforms.h
@@ -96,7 +96,7 @@ void populateMemRefNarrowTypeEmulationConversions(
 /// It returns the new allocation if the original allocation was multi-buffered
 /// and returns failure() otherwise.
 /// When `skipOverrideAnalysis`, the pass will apply the transformation
-/// without checking thwt the buffer is overrided at the beginning of each
+/// without checking thwt the buffer is overriden at the beginning of each
 /// iteration. This implies that user knows that there is no data carried across
 /// loop iterations. Example:
 /// ```
diff --git a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
index 7e139663d74b47..ccc58a4e76871f 100644
--- a/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
+++ b/mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
@@ -12,7 +12,7 @@
 // dialects and lower level NVVM dialect. This allow representing PTX specific
 // operations while using MLIR high level concepts like memref and 2-D vector.
 //
-// Ops semantic are going to be based on vendor specific PTX defintion:
+// Ops semantic are going to be based on vendor specific PTX definition:
 // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
 //
 //===----------------------------------------------------------------------===//
@@ -38,7 +38,7 @@ def NVGPU_Dialect : Dialect {
 
   let useDefaultTypePrinterParser = 1;
   let useDefaultAttributePrinterParser = 1;
-  
+
   let extraClassDeclaration = [{
     /// Return true if the given MemRefType has an integer address
     /// space that matches the NVVM shared memory address space or
@@ -70,9 +70,9 @@ def TensorMapSwizzleNone : I32EnumAttrCase<"SWIZZLE_NONE", 0, "none">;
 def TensorMapSwizzle32B  : I32EnumAttrCase<"SWIZZLE_32B", 1, "swizzle_32b">;
 def TensorMapSwizzle64B  : I32EnumAttrCase<"SWIZZLE_64B", 2, "swizzle_64b">;
 def TensorMapSwizzle128B : I32EnumAttrCase<"SWIZZLE_128B", 3, "swizzle_128b">;
-def TensorMapSwizzleKind : I32EnumAttr<"TensorMapSwizzleKind", 
+def TensorMapSwizzleKind : I32EnumAttr<"TensorMapSwizzleKind",
                                 "Tensor map swizzling mode of shared memory banks",
-  [ TensorMapSwizzleNone, TensorMapSwizzle32B, TensorMapSwizzle64B, 
+  [ TensorMapSwizzleNone, TensorMapSwizzle32B, TensorMapSwizzle64B,
     TensorMapSwizzle128B]> {
   let genSpecializedAttr = 0;
   let cppNamespace = "::mlir::nvgpu";
@@ -82,9 +82,9 @@ def TensorMapL2PromoNone : I32EnumAttrCase<"L2PROMO_NONE", 0, "none">;
 def TensorMapL2Promo64B  : I32EnumAttrCase<"L2PROMO_64B", 1, "l2promo_64b">;
 def TensorMapL2Promo128B : I32EnumAttrCase<"L2PROMO_128B", 2, "l2promo_128b">;
 def TensorMapL2Promo256B : I32EnumAttrCase<"L2PROMO_256B", 3, "l2promo_256b">;
-def TensorMapL2PromoKind : I32EnumAttr<"TensorMapL2PromoKind", 
+def TensorMapL2PromoKind : I32EnumAttr<"TensorMapL2PromoKind",
                                 "Tensor map L2 promotion type",
-  [ TensorMapL2PromoNone, TensorMapL2Promo64B, TensorMapL2Promo128B, 
+  [ TensorMapL2PromoNone, TensorMapL2Promo64B, TensorMapL2Promo128B,
     TensorMapL2Promo256B]> {
   let genSpecializedAttr = 0;
   let cppNamespace = "::mlir::nvgpu";
@@ -92,7 +92,7 @@ def TensorMapL2PromoKind : I32EnumAttr<"TensorMapL2PromoKind",
 
 def TensorMapOOBZero : I32EnumAttrCase<"OOB_ZERO", 0, "zero">;
 def TensorMapOOBNaN  : I32EnumAttrCase<"OOB_NAN", 1, "nan">;
-def TensorMapOOBKind : I32EnumAttr<"TensorMapOOBKind", 
+def TensorMapOOBKind : I32EnumAttr<"TensorMapOOBKind",
                                 "Tensor map out-of-bounds fill type",
   [ TensorMapOOBZero, TensorMapOOBNaN]> {
   let genSpecializedAttr = 0;
@@ -102,7 +102,7 @@ def TensorMapOOBKind : I32EnumAttr<"TensorMapOOBKind",
 def TensorMapInterleaveNone : I32EnumAttrCase<"INTERLEAVE_NONE", 0, "none">;
 def TensorMapInterleave16B  : I32EnumAttrCase<"INTERLEAVE_16B", 1, "interleave_16b">;
 def TensorMapInterleave32B  : I32EnumAttrCase<"INTERLEAVE_32B", 2, "interleave_32b">;
-def TensorMapInterleaveKind : I32EnumAttr<"TensorMapInterleaveKind", 
+def TensorMapInterleaveKind : I32EnumAttr<"TensorMapInterleaveKind",
                                 "Tensor map interleave layout type",
   [ TensorMapInterleaveNone, TensorMapInterleave16B, TensorMapInterleave32B]> {
   let genSpecializedAttr = 0;
@@ -138,16 +138,16 @@ def NVGPU_DeviceAsyncToken : NVGPU_Type<"DeviceAsyncToken",
 def NVGPU_MBarrierGroup : NVGPU_Type<"MBarrierGroup", "mbarrier.group", []> {
   let summary = "mbarrier barrier type";
   let description = [{
-    This is the type for one or more mbarrier object in shared memory that is 
+    This is the type for one or more mbarrier object in shared memory that is
     used to synchronize a variable number of threads.
 
     If `num_barriers` is not set, the number of mbarrier objects is 1.
 
-    A mbarrier object is 64 bit with 8 byte alignment. The mbarrier object 
+    A mbarrier object is 64 bit with 8 byte alignment. The mbarrier object
     can be initiated and invalidated.
 
     [See for more details in PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#size-and-alignment-of-mbarrier-object)
-  }];    
+  }];
   let parameters = (ins "Attribute":$memorySpace, DefaultValuedParameter<"unsigned", "1">:$num_barriers);
   let assemblyFormat = "`<` struct(params) `>`";
   let builders = [
@@ -168,8 +168,8 @@ def NVGPU_TensorMapDescriptor : NVGPU_Type<"TensorMapDescriptor", "tensormap.des
                         EnumParameter<TensorMapOOBKind>:$oob,
                         EnumParameter<TensorMapInterleaveKind>:$interleave);
   let description = [{
-    `nvgpu.tma.descriptor` is a type that represents a TMA descriptor. It is 
-    128-byte object either in constant space or kernel paramater.    
+    `nvgpu.tma.descriptor` is a type that represents a TMA descriptor. It is
+    128-byte object either in constant space or kernel parameter.
   }];
   let assemblyFormat = "`<` struct(params) `>`";
 }
@@ -177,9 +177,9 @@ def NVGPU_TensorMapDescriptor : NVGPU_Type<"TensorMapDescriptor", "tensormap.des
 def NVGPU_WarpgroupMatrixDescriptor : NVGPU_Type<"WarpgroupMatrixDescriptor", "warpgroup.descriptor", []> {
   let summary = "Warpgroup matrix descriptor type";
   let description = [{
-  The descriptor specifies the properties of the matrix in shared memory that 
-  is a multiplicand in the matrix multiply and accumulate operation. 
-  
+  The descriptor specifies the properties of the matrix in shared memory that
+  is a multiplicand in the matrix multiply and accumulate operation.
+
   The descriptor is a 64-bit value contained in a register with the following:
   ```
   +---------+-----+-----------+-----+-----------+-----+-----+-----------+-----+
@@ -190,10 +190,10 @@ def NVGPU_WarpgroupMatrixDescriptor : NVGPU_Type<"WarpgroupMatrixDescriptor", "w
   | BaseAddr|  0  | LeadingDim|  0  |   Stride  |  0  |Offst|     0     |Swzle|
   +---------+-----+-----------+-----+-----------+-----+-----+-----------+-----+
   ```
-   
-  [See for more details in PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-shared-memory-layout-matrix-descriptor) 
-  
-  }];  
+
+  [See for more details in PTX ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-shared-memory-layout-matrix-descriptor)
+
+  }];
   let parameters = (ins "MemRefType":$tensor);
   let assemblyFormat = "`<` struct(params) `>`";
 }
@@ -202,12 +202,12 @@ def NVGPU_WarpgroupAccumulator : NVGPU_Type<"WarpgroupAccumulator", "warpgroup.a
   let parameters = (ins "VectorType":$fragmented);
   let assemblyFormat = "`<` struct(params) `>`";
   let description = [{
-    This type represents the result matrix obtained from `nvgpu.warpgroup.mma`. 
-    The `$fragmented` type signifies the distributed or fragmented result 
-    vector that is collectively owned by all the threads in the warp-group 
+    This type represents the result matrix obtained from `nvgpu.warpgroup.mma`.
+    The `$fragmented` type signifies the distributed or fragmented result
+    vector that is collectively owned by all the threads in the warp-group
     that executed `nvgpu.warpgroup.mma`.
     [See the details of register fragment layout for accumulator matrix D]
-    (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d) 
+    (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d)
   }];
 }
 
@@ -271,7 +271,7 @@ def NVGPU_MmaSyncOp : NVGPU_MmaSyncOp<"mma.sync"> {
   let description = [{
     The `nvgpu.mma.sync` op represents the warp-level matrix-multiply-and-
     accumulate (mma) operation that is compatible with `nvvm.mma.sync`.
-    The operands and results vector sizes are thread-level onwership to
+    The operands and results vector sizes are thread-level ownership to
     the warp-level mma operation shape. `mmaShape` attribute holds the
     warp-level matrix-multiply shape.
 
@@ -492,12 +492,12 @@ def NVGPU_DeviceAsyncWaitOp : NVGPU_Op<"device_async_wait", []> {
 def NVGPU_MBarrierCreateOp : NVGPU_Op<"mbarrier.create", []> {
   let summary = "Creates a `nvgpu.mbarrier` object.";
   let description = [{
-    The Op generates one or more `mbarrier` object, which is a barrier created in 
+    The Op generates one or more `mbarrier` object, which is a barrier created in
     shared memory and supports various synchronization behaviors for threads.
 
     The `mbarrier` object has the following type and alignment requirements:
       Type: .b64, Alignment: 8, Memory space: .shared
-    
+
     Example:
     ```mlir
       %barrier = nvgpu.mbarrier.create -> !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>
@@ -529,7 +529,7 @@ def NVGPU_MBarrierInitOp : NVGPU_Op<"mbarrier.init", []> {
 def NVGPU_MBarrierTestWaitOp : NVGPU_Op<"mbarrier.test.wait", []> {
   let summary = "Checks if the `nvgpu.mbarrier` has completed its current phase.";
   let description = [{
-    Checks whether the mbarrier object has completed the phase. It is is a 
+    Checks whether the mbarrier object has completed the phase. It is is a
     non-blocking instruction which tests for the completion of the phase.
 
     Example:
@@ -545,7 +545,7 @@ def NVGPU_MBarrierTestWaitOp : NVGPU_Op<"mbarrier.test.wait", []> {
 def NVGPU_MBarrierArriveOp : NVGPU_Op<"mbarrier.arrive", []> {
   let summary = "Performs arrive operation on the `nvgpu.mbarrier.arrive`.";
   let description = [{
-    The Op performs arrive-on operation on the `mbarrier` object and returns a 
+    The Op performs arrive-on operation on the `mbarrier` object and returns a
     `nvgpu.mbarrier.token`.
 
     For more information, see
@@ -564,7 +564,7 @@ let assemblyFormat = "$barriers `[` $mbarId `]` attr-dict `:` type($barriers) `-
 def NVGPU_MBarrierArriveNoCompleteOp : NVGPU_Op<"mbarrier.arrive.nocomplete", []> {
   let summary = "Performs arrive operation on the `nvgpu.mbarrier.arrive.nocomplete` as non-blocking.";
   let description = [{
-    The Op performs arrive-on operation on the `mbarrier` object and returns a 
+    The Op performs arrive-on operation on the `mbarrier` object and returns a
     `nvgpu.mbarrier.token`.
 
     The Op does not cause the `nvgpu.mbarrier` to complete its current phase.
@@ -583,13 +583,13 @@ def NVGPU_MBarrierArriveNoCompleteOp : NVGPU_Op<"mbarrier.arrive.nocomplete", []
 def NVGPU_MBarrierArriveExpectTxOp : NVGPU_Op<"mbarrier.arrive.expect_tx", []> {
   let summary = "Performs expect_tx operation on the `nvgpu.mbarrier.arrive`";
   let description = [{
-    A thread executing the Op performs an expect-tx operation on the mbarrier 
-    object at the location specified by the address operand $barrier. The 
-    expect-tx operation, with an $txcount argument, increases the tx-count of 
-    an mbarrier object by the value specified by $txcount. This makes the 
-    current phase of the mbarrier object to expect and track the completion of 
+    A thread executing the Op performs an expect-tx operation on the mbarrier
+    object at the location specified by the address operand $barrier. The
+    expect-tx operation, with an $txcount argument, increases the tx-count of
+    an mbarrier object by the value specified by $txcount. This makes the
+    current phase of the mbarrier object to expect and track the completion of
     additional asynchronous transactions.
-    
+
     The `$txCount` specifies the number of element to the expect-tx operation.
 
     Example:
@@ -604,10 +604,10 @@ def NVGPU_MBarrierArriveExpectTxOp : NVGPU_Op<"mbarrier.arrive.expect_tx", []> {
 def NVGPU_MBarrierTryWaitParityOp : NVGPU_Op<"mbarrier.try_wait.parity", []> {
   let summary = "Waits for the `nvgpu.mbarrier` to complete its current phase.";
   let description = [{
-    Checks whether the mbarrier object has completed the phase. It is is a 
-    potentially blocking instruction which tests for the completion of the 
-    phase. Suspended thread resumes execution when the specified phase completes 
-    OR before the phase completes following a system-dependent time limit. 
+    Checks whether the mbarrier object has completed the phase. It is is a
+    potentially blocking instruction which tests for the completion of the
+    phase. Suspended thread resumes execution when the specified phase completes
+    OR before the phase completes following a system-dependent time limit.
 
     Example:
     ```mlir
@@ -616,13 +616,13 @@ def NVGPU_MBarrierTryWaitParityOp : NVGPU_Op<"mbarrier.try_wait.parity", []> {
 
   }];
   let arguments = (ins NVGPU_MBarrierGroup:$barriers, Index:$phase, Index:$ticks, Index:$mbarId);
-  let assemblyFormat = "$barriers `[` $mbarId `]` `,` $phase `,` $ticks attr-dict `:` type($barriers)";  
+  let assemblyFormat = "$barriers `[` $mbarId `]` `,` $phase `,` $ticks attr-dict `:` type($barriers)";
 }
 
 def NVGPU_TmaPrefetchOp : NVGPU_Op<"tma.prefetch.descriptor", []> {
   let summary = "Prefetch given `nvgpu.tensormap.descriptor` ";
   let description = [{
-    The Op brings the cache line containing the given `$tmaDescriptor` for 
+    The Op brings the cache line containing the given `$tmaDescriptor` for
     subsequent use by the `tma.async.load` instruction.
   }];
   let arguments = (ins NVGPU_TensorMapDescriptor:$tensorMapDescriptor, Optional<I1>:$predicate);
@@ -634,27 +634,27 @@ def NVGPU_TmaPrefetchOp : NVGPU_Op<"tma.prefetch.descriptor", []> {
 def NVGPU_TmaAsyncLoadOp : NVGPU_Op<"tma.async.load", [AttrSizedOperandSegments]> {
   let summary = "TMA asynchronous load";
   let description = [{
-    The Op loads a tile memory region from global memory to shared memory by 
+    The Op loads a tile memory region from global memory to shared memory by
     Tensor Memory Access (TMA).
-    
+
     `$tensorMapDescriptor` is tensor map descriptor which has information about
     tile shape. The descriptor is created by `nvgpu.tma.create.descriptor`
 
-    The Op uses `$barrier` mbarrier based completion mechanism. 
-  }];  
+    The Op uses `$barrier` mbarrier based completion mechanism.
+  }];
   let arguments = (ins  Arg<AnyMemRef, "", [MemWriteAt<0, FullEffect>]>:$dst,
                         NVGPU_MBarrierGroup:$barriers,
                         NVGPU_TensorMapDescriptor:$tensorMapDescriptor,
-                        Variadic<Index>:$coordinates, 
+                        Variadic<Index>:$coordinates,
                         Index:$mbarId,
                         Optional<I16>:$multicastMask,
                         Optional<I1>:$predicate);
   let assemblyFormat = [{
-    $tensorMapDescriptor `[` $coordinates `]` `,` $barriers `[` $mbarId `]` 
+    $tensorMapDescriptor `[` $coordinates `]` `,` $barriers `[` $mbarId `]`
       `to` $dst
       (`multicast_mask` `=` $multicastMask^ )?
       (`,` `predicate` `=` $predicate^)?
-      attr-dict `:` type($tensorMapDescriptor) `,` type($barriers) 
+      attr-dict `:` type($tensorMapDescriptor) `,` type($barriers)
       `->` type($dst)
   }];
   let hasVerifier = 1;
@@ -664,11 +664,11 @@ def NVGPU_TmaAsyncLoadOp : NVGPU_Op<"tma.async.load", [AttrSizedOperandSegments]
 def NVGPU_TmaCreateDescriptorOp : NVGPU_Op<"tma.create.descriptor", []> {
   let summary = "TMA create descriptor";
   let description = [{
-    The Op creates a tensor map descriptor object representing tiled memory 
-    region. To do that it calls CUDA Driver's `cuTensorMapEncodeTiled`. The 
+    The Op creates a tensor map descriptor object representing tiled memory
+    region. To do that it calls CUDA Driver's `cuTensorMapEncodeTiled`. The
     descriptor is used by Tensor Memory Access (TMA).
 
-    The `tensor` is the source tensor to be tiled. 
+    The `tensor` is the source tensor to be tiled.
 
     The `boxDimensions` is the size of the tiled memory region in each dimension.
 
@@ -688,15 +688,15 @@ def NVGPU_TmaCreateDescriptorOp : NVGPU_Op<"tma.create.descriptor", []> {
 def NVGPU_WarpgroupGenerateDescriptorOp : NVGPU_Op<"warpgroup.generate.descriptor", []> {
   let summary = "Generate a warpgroup matrix descriptor";
   let description = [{
-  This Op builds a `nvgpu.warpgroup.descriptor` that is used by 
-  `nvgpu.warpgroup.mma` to perform warpgroup-level matrix multiply and 
+  This Op builds a `nvgpu.warpgroup.descriptor` that is used by
+  `nvgpu.warpgroup.mma` to perform warpgroup-level matrix multiply and
   accumulate.
 
-  The descriptor specifies the properties of the matrix in shared memory that 
-  is a multiplicand in the matrix multiply and accumulate operation. 
-  }];  
+  The descriptor specifies the properties of the matrix in shared memory that
+  is a multiplicand in the matrix multiply and accumulate operation.
+  }];
   let results = (outs NVGPU_WarpgroupMatrixDescriptor:$descriptor);
-  let arguments = (ins Arg<AnyMemRef, "", [MemRead]>:$tensor, 
+  let arguments = (ins Arg<AnyMemRef, "", [MemRead]>:$tensor,
                        NVGPU_TensorMapDescriptor:$tensorMap);
   let assemblyFormat = [{$tensor `,` $tensorMap attr-dict `:` type($tensor) `,` type($tensorMap) `->` type($descriptor)}];
   let hasVerifier = 1;
@@ -704,42 +704,42 @@ def NVGPU_WarpgroupGenerateDescriptorOp : NVGPU_Op<"warpgroup.generate.descripto
 
 def NVGPU_WarpgroupMmaOp : NVGPU_Op<"warpgroup.mma"> {
   let description = [{
-    The `nvgpu.warpgroup.mma` op performs the warpgroup-level (4 warps) 
-    matrix-multiply-and-accumulate (mma) operation that results in 
-    `nvvm.wgmma.mma_async`. 
-    
-    The operands are `descriptorA` and `descriptorB` that are wgmma matrix 
-    descriptors that shows the properties of the matrix in shared memory. The 
-    results are thread-level ownership to the warpgroup-level mma operation 
+    The `nvgpu.warpgroup.mma` op performs the warpgroup-level (4 warps)
+    matrix-multiply-and-accumulate (mma) operation that results in
+    `nvvm.wgmma.mma_async`.
+
+    The operands are `descriptorA` and `descriptorB` that are wgmma matrix
+    descriptors that shows the properties of the matrix in shared memory. The
+    results are thread-level ownership to the warpgroup-level mma operation
     shape. The shape is deduced from the descriptor types and output vector.
 
-    The Op encapsulates multiple `nvvm.wgmma.mma_async` operations to complete 
-    the given shape. As `nvvm.wgmma.async` Op, or its corresponding PTX 
-    instruction, is asynchronous, this Op groups the `nvvm.wgmma.async` and 
-    surrounds them between `wgmma.fence.aligned` and 
+    The Op encapsulates multiple `nvvm.wgmma.mma_async` operations to complete
+    the given shape. As `nvvm.wgmma.async` Op, or its corresponding PTX
+    instruction, is asynchronous, this Op groups the `nvvm.wgmma.async` and
+    surrounds them between `wgmma.fence.aligned` and
     `wgmma.commit.group.sync.aligned`, `wgmma.wait.group.sync.aligned` Ops.
 
     Example:
     ```mlir
-      %r1,%r2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2: 
-                 !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>, 
-                 !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>, 
+      %r1,%r2 = nvgpu.warpgroup.mma %descA, %descB, %acc1, %acc2:
+                 !nvgpu.warpgroup.descriptor<tensor = memref<128x64xf16, 3>>,
+                 !nvgpu.warpgroup.descriptor<tensor = memref<64x128xf16, 3>>,
                  !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
                  !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
-                 -> 
+                 ->
                  !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>,
                  !nvgpu.warpgroup.accumulator<fragmented = vector<64x128xf32>>
     ```
   }];
 
-  let arguments = (ins NVGPU_WarpgroupMatrixDescriptor:$descriptorA, 
-                       NVGPU_WarpgroupMatrixDescriptor:$descriptorB,                                               
+  let arguments = (ins NVGPU_WarpgroupMatrixDescriptor:$descriptorA,
+                       NVGPU_WarpgroupMatrixDescriptor:$descriptorB,
                        DefaultValuedOptionalAttr<I32Attr, "1">:$waitGroup,
                        OptionalAttr<UnitAttr>:$transposeA,
                        OptionalAttr<UnitAttr>:$transposeB,
                        NVGPU_WarpgroupAccumulator:$matrixC);
   let results = (outs NVGPU_WarpgroupAccumulator:$matrixD);
-  let assemblyFormat = [{    
+  let assemblyFormat = [{
     $descriptorA`,` $descriptorB`,` $matrixC attr-dict
     `:` type($descriptorA) `,` type($descriptorB) `,` type($matrixC) `->` type($matrixD)
   }];
@@ -748,29 +748,29 @@ def NVGPU_WarpgroupMmaOp : NVGPU_Op<"warpgroup.mma"> {
 
 def NVGPU_WarpgroupMmaStoreOp : NVGPU_Op<"warpgroup.mma.store"> {
   let description = [{
-    The `nvgpu.warpgroup.mma.store` op performs the store of fragmented result 
-    in $matrixD to given memref. 
+    The `nvgpu.warpgroup.mma.store` op performs the store of fragmented result
+    in $matrixD to given memref.
 
     [See the details of register fragment layout for accumulator matrix D]
-    (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d) 
+    (https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d)
 
     Note that, the op must be run with warp group.
   }];
 
   let arguments = (ins NVGPU_WarpgroupAccumulator:$matrixD,
                        Arg<AnyMemRef, "", [MemWrite]>:$dstMemref);
-  
+
   let assemblyFormat = [{
     $matrixD `,` $dstMemref attr-dict `:` type($matrixD) `to` type($dstMemref)
   }];
   let hasVerifier = 1;
 }
 
-def NVGPU_WarpgroupMmaInitAccumulatorOp : NVGPU_Op<"warpgroup.mma.init.accumulator"> {  
+def NVGPU_WarpgroupMmaInitAccumulatorOp : NVGPU_Op<"warpgroup.mma.init.accumulator"> {
   let summary = "Initializes the accumulator matrix";
 
   let description = [{
-    This Op generates and initializes the accumulator matrix for 
+    This Op generates and initializes the accumulator matrix for
     `nvgpu.warpgroup.mma` op to perform matrix-multiply-and-accumulate.
   }];
   let results = (outs NVGPU_WarpgroupAccumulator:$matrixC);
diff --git a/mlir/include/mlir/Dialect/NVGPU/Transforms/Transforms.h b/mlir/include/mlir/Dialect/NVGPU/Transforms/Transforms.h
index 4bac8986567ea8..d3825c4246c177 100644
--- a/mlir/include/mlir/Dialect/NVGPU/Transforms/Transforms.h
+++ b/mlir/include/mlir/Dialect/NVGPU/Transforms/Transforms.h
@@ -64,7 +64,7 @@ enum class MmaSyncF32Lowering { TF32 = 0, TF32x3 = 1, Unkown = 2 };
 /// Typically, tf32 tensor core acceleration comes at a cost
 /// of accuracy from missing precision bits. While f32 has 23 precision
 /// bits, tf32 has only 10 precision bits. tf32x3 aims to recover the
-/// precision bits by spliting each operand into two tf32 values
+/// precision bits by splitting each operand into two tf32 values
 /// and issue three mma.sync tensor core operations.
 void populateMmaSyncF32ToTF32Patterns(
     RewritePatternSet &patterns,
diff --git a/mlir/include/mlir/Dialect/SparseTensor/IR/Enums.h b/mlir/include/mlir/Dialect/SparseTensor/IR/Enums.h
index ac91bfa5ae622d..bb9b0371b048a9 100644
--- a/mlir/include/mlir/Dialect/SparseTensor/IR/Enums.h
+++ b/mlir/include/mlir/Dialect/SparseTensor/IR/Enums.h
@@ -319,7 +319,7 @@ constexpr std::optional<LevelType> buildLevelType(LevelFormat lf, bool ordered,
 }
 
 //
-// Ensure the above methods work as indended.
+// Ensure the above methods work as intended.
 //
 
 static_assert(
diff --git a/mlir/include/mlir/Dialect/SparseTensor/IR/SparseTensorOps.td b/mlir/include/mlir/Dialect/SparseTensor/IR/SparseTensorOps.td
index 3127cf1b1bcf69..2dadadc4e8b6bb 100644
--- a/mlir/include/mlir/Dialect/SparseTensor/IR/SparseTensorOps.td
+++ b/mlir/include/mlir/Dialect/SparseTensor/IR/SparseTensorOps.td
@@ -39,7 +39,7 @@ def SparseTensor_NewOp : SparseTensor_Op<"new", [Pure]>,
     is kept deliberately very general to allow for alternative implementations
     in the future, such as pointers to buffers or runnable initialization
     code. The operation is provided as an anchor that materializes a properly
-    typed sparse tensor with inital contents into a computation.
+    typed sparse tensor with initial contents into a computation.
 
     Reading in a symmetric matrix will result in just the lower/upper triangular
     part of the matrix (so that only relevant information is stored). Proper
@@ -61,7 +61,7 @@ def SparseTensor_AssembleOp : SparseTensor_Op<"assemble", [Pure]>,
   let summary = "Returns a sparse tensor assembled from the given values and levels";
 
   let description = [{
-    Assembles the values and per-level coordinate or postion arrays into a sparse tensor.
+    Assembles the values and per-level coordinate or position arrays into a sparse tensor.
     The order and types of provided levels must be consistent with the actual storage
     layout of the returned sparse tensor described below.
 
@@ -69,7 +69,7 @@ def SparseTensor_AssembleOp : SparseTensor_Op<"assemble", [Pure]>,
       supplies the value for each stored element in the sparse tensor.
     - `levels: [tensor<? x iType>, ...]`
       each supplies the sparse tensor coordinates scheme in the sparse tensor for
-      the corresponding level as specifed by `sparse_tensor::StorageLayout`.
+      the corresponding level as specified by `sparse_tensor::StorageLayout`.
 
     This operation can be used to assemble a sparse tensor from external
     sources; e.g., when passing two numpy arrays from Python.
@@ -675,7 +675,7 @@ def SparseTensor_InsertOp : SparseTensor_Op<"insert",
     Inserts the value into the underlying storage of the tensor at the
     given level-coordinates. The arity of `lvlCoords` must match the
     level-rank of the tensor. This operation can only be applied when
-    the tensor materializes unintialized from a `tensor.empty` operation
+    the tensor materializes uninitialized from a `tensor.empty` operation
     and the final tensor is constructed with a `load` operation which
     has the `hasInserts` attribute set.
 
diff --git a/mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td b/mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td
index f38779ed9ed2b8..0e8a50e4508f34 100644
--- a/mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td
+++ b/mlir/include/mlir/Dialect/SparseTensor/Transforms/Passes.td
@@ -386,7 +386,7 @@ def SparseGPUCodegen : Pass<"sparse-gpu-codegen", "ModuleOp"> {
 }
 
 def StorageSpecifierToLLVM : Pass<"sparse-storage-specifier-to-llvm", "ModuleOp"> {
-  let summary = "Lower sparse storage specifer to llvm structure";
+  let summary = "Lower sparse storage specifier to llvm structure";
   let description = [{
      This pass rewrites sparse tensor storage specifier-related operations into
      LLVMDialect, and converts sparse tensor storage specifier into an llvm.struct.
diff --git a/mlir/include/mlir/ExecutionEngine/ExecutionEngine.h b/mlir/include/mlir/ExecutionEngine/ExecutionEngine.h
index 66f49c787c1905..7f8d7d310c6ad9 100644
--- a/mlir/include/mlir/ExecutionEngine/ExecutionEngine.h
+++ b/mlir/include/mlir/ExecutionEngine/ExecutionEngine.h
@@ -151,7 +151,7 @@ class ExecutionEngine {
   llvm::Expected<void (*)(void **)> lookupPacked(StringRef name) const;
 
   /// Looks up the original function with the given name and returns a
-  /// pointer to it. This is not necesarily a packed function. Propagates
+  /// pointer to it. This is not necessarily a packed function. Propagates
   /// errors in case of failure.
   llvm::Expected<void *> lookup(StringRef name) const;
 
diff --git a/mlir/include/mlir/ExecutionEngine/MemRefUtils.h b/mlir/include/mlir/ExecutionEngine/MemRefUtils.h
index 918647d9feac34..3267d48a497375 100644
--- a/mlir/include/mlir/ExecutionEngine/MemRefUtils.h
+++ b/mlir/include/mlir/ExecutionEngine/MemRefUtils.h
@@ -125,7 +125,7 @@ allocAligned(size_t nElements, AllocFunType allocFun = &::malloc,
 
 /// Convenient callback to "visit" a memref element by element.
 /// This takes a reference to an individual element as well as the coordinates.
-/// It can be used in conjuction with a StridedMemrefIterator.
+/// It can be used in conjunction with a StridedMemrefIterator.
 template <typename T>
 using ElementWiseVisitor = llvm::function_ref<void(T &ptr, ArrayRef<int64_t>)>;
 
diff --git a/mlir/include/mlir/ExecutionEngine/SparseTensorRuntime.h b/mlir/include/mlir/ExecutionEngine/SparseTensorRuntime.h
index 8b0829aab0d8d0..b57de7c0410613 100644
--- a/mlir/include/mlir/ExecutionEngine/SparseTensorRuntime.h
+++ b/mlir/include/mlir/ExecutionEngine/SparseTensorRuntime.h
@@ -190,7 +190,7 @@ MLIR_CRUNNERUTILS_EXPORT void delSparseTensorReader(void *p);
 /// Only the extended FROSTT format is supported currently.
 MLIR_CRUNNERUTILS_EXPORT void *createSparseTensorWriter(char *filename);
 
-/// Finalizes the outputing of a sparse tensor to a file and releases the
+/// Finalizes the outputting of a sparse tensor to a file and releases the
 /// SparseTensorWriter.
 MLIR_CRUNNERUTILS_EXPORT void delSparseTensorWriter(void *p);
 
diff --git a/mlir/python/mlir/dialects/func.py b/mlir/python/mlir/dialects/func.py
index 24fdcbcd85b29f..92237ca5e0897d 100644
--- a/mlir/python/mlir/dialects/func.py
+++ b/mlir/python/mlir/dialects/func.py
@@ -148,7 +148,7 @@ def from_py_func(
           * `func_op`: The `func` op being defined.
 
         By default, the function name will be the Python function `__name__`. This
-        can be overriden by passing the `name` argument to the decorator.
+        can be overridden by passing the `name` argument to the decorator.
 
         If `results` is not specified, then the decorator will implicitly
         insert a `ReturnOp` with the `Value`'s returned from the decorated
diff --git a/mlir/python/mlir/dialects/linalg/opdsl/lang/affine.py b/mlir/python/mlir/dialects/linalg/opdsl/lang/affine.py
index 9fa626dfa78b1b..35cd7b099994e8 100644
--- a/mlir/python/mlir/dialects/linalg/opdsl/lang/affine.py
+++ b/mlir/python/mlir/dialects/linalg/opdsl/lang/affine.py
@@ -103,7 +103,7 @@ def get_dim(self, dimname: str) -> int:
             if not self.allow_new_dims:
                 raise ValueError(
                     f"New dimensions not allowed in the current affine expression: "
-                    f"Requested '{dimname}', Availble: {self.all_dims}"
+                    f"Requested '{dimname}', Available: {self.all_dims}"
                 )
             pos = len(self.all_dims)
             self.all_dims[dimname] = pos
@@ -117,7 +117,7 @@ def get_symbol(self, symname: str) -> int:
             if not self.allow_new_symbols:
                 raise ValueError(
                     f"New symbols not allowed in the current affine expression: "
-                    f"Requested '{symname}', Availble: {self.all_symbols}"
+                    f"Requested '{symname}', Available: {self.all_symbols}"
                 )
             pos = len(self.all_symbols)
             self.all_symbols[symname] = pos



More information about the Mlir-commits mailing list